scispace - formally typeset
Search or ask a question

Showing papers by "Andrew Zisserman published in 2008"


Proceedings ArticleDOI
16 Dec 2008
TL;DR: Results show that learning the optimum kernel combination of multiple features vastly improves the performance, from 55.1% for the best single feature to 72.8% forThe combination of all features.
Abstract: We investigate to what extent combinations of features can improve classification performance on a large dataset of similar classes. To this end we introduce a 103 class flower dataset. We compute four different features for the flowers, each describing different aspects, namely the local shape/texture, the shape of the boundary, the overall spatial distribution of petals, and the colour. We combine the features using a multiple kernel framework with a SVM classifier. The weights for each class are learnt using the method of Varma and Ray, which has achieved state of the art performance on other large dataset, such as Caltech 101/256. Our dataset has a similar challenge in the number of classes, but with the added difficulty of large between class similarity and small within class similarity. Results show that learning the optimum kernel combination of multiple features vastly improves the performance, from 55.1% for the best single feature to 72.8% for the combination of all features.

2,619 citations


Proceedings ArticleDOI
23 Jun 2008
TL;DR: In this paper, a weighted set of visual words is obtained by selecting words based on proximity in descriptor space, and this representation may be incorporated into a standard tf-idf architecture and how spatial verification is modified in the case of this soft-assignment.
Abstract: The state of the art in visual object retrieval from large databases is achieved by systems that are inspired by text retrieval. A key component of these approaches is that local regions of images are characterized using high-dimensional descriptors which are then mapped to ldquovisual wordsrdquo selected from a discrete vocabulary.This paper explores techniques to map each visual region to a weighted set of words, allowing the inclusion of features which were lost in the quantization stage of previous systems. The set of visual words is obtained by selecting words based on proximity in descriptor space. We describe how this representation may be incorporated into a standard tf-idf architecture, and how spatial verification is modified in the case of this soft-assignment. We evaluate our method on the standard Oxford Buildings dataset, and introduce a new dataset for evaluation. Our results exceed the current state of the art retrieval performance on these datasets, particularly on queries with poor initial recall where techniques like query expansion suffer. Overall we show that soft-assignment is always beneficial for retrieval with large vocabularies, at a cost of increased storage requirements for the index.

1,630 citations


Proceedings Article
08 Dec 2008
TL;DR: A novel sparse representation for signals belonging to different classes in terms of a shared dictionary and discriminative class models is proposed, with results on standard handwritten digit and texture classification tasks.
Abstract: It is now well established that sparse signal models are well suited for restoration tasks and can be effectively learned from audio, image, and video data. Recent research has been aimed at learning discriminative sparse models instead of purely reconstructive ones. This paper proposes a new step in that direction, with a novel sparse representation for signals belonging to different classes in terms of a shared dictionary and discriminative class models. The linear version of the proposed model admits a simple probabilistic interpretation, while its most general variant admits an interpretation in terms of kernels. An optimization framework for learning all the components of the proposed model is presented, along with experimental results on standard handwritten digit and texture classification tasks.

1,108 citations


Proceedings ArticleDOI
23 Jun 2008
TL;DR: This article proposes an energy formulation with both sparse reconstruction and class discrimination components, jointly optimized during dictionary learning, for local image discrimination tasks, and paves the way for a novel scene analysis and recognition framework based on simultaneously learning discriminative and reconstructive dictionaries.
Abstract: Sparse signal models have been the focus of much recent research, leading to (or improving upon) state-of-the-art results in signal, image, and video restoration. This article extends this line of research into a novel framework for local image discrimination tasks, proposing an energy formulation with both sparse reconstruction and class discrimination components, jointly optimized during dictionary learning. This approach improves over the state of the art in texture segmentation experiments using the Brodatz database, and it paves the way for a novel scene analysis and recognition framework based on simultaneously learning discriminative and reconstructive dictionaries. Preliminary results in this direction using examples from the Pascal VOC06 and Graz02 datasets are presented as well.

828 citations


Journal ArticleDOI
TL;DR: This work introduces a novel vocabulary using dense color SIFT descriptors and investigates the classification performance under changes in the size of the visual vocabulary, the number of latent topics learned, and the type of discriminative classifier used (k-nearest neighbor or SVM).
Abstract: We investigate whether dimensionality reduction using a latent generative model is beneficial for the task of weakly supervised scene classification. In detail, we are given a set of labeled images of scenes (for example, coast, forest, city, river, etc.), and our objective is to classify a new image into one of these categories. Our approach consists of first discovering latent ";topics"; using probabilistic Latent Semantic Analysis (pLSA), a generative model from the statistical text literature here applied to a bag of visual words representation for each image, and subsequently, training a multiway classifier on the topic distribution vector for each image. We compare this approach to that of representing each image by a bag of visual words vector directly and training a multiway classifier on these vectors. To this end, we introduce a novel vocabulary using dense color SIFT descriptors and then investigate the classification performance under changes in the size of the visual vocabulary, the number of latent topics learned, and the type of discriminative classifier used (k-nearest neighbor or SVM). We achieve superior classification performance to recent publications that have used a bag of visual word representation, in all cases, using the authors' own data sets and testing protocols. We also investigate the gain in adding spatial information. We show applications to image retrieval with relevance feedback and to scene classification in videos.

778 citations


Proceedings ArticleDOI
23 Jun 2008
TL;DR: An approach that progressively reduces the search space for body parts, to greatly improve the chances that pose estimation will succeed, and an integrated spatio- temporal model covering multiple frames to refine pose estimates from individual frames, with inference using belief propagation.
Abstract: The objective of this paper is to estimate 2D human pose as a spatial configuration of body parts in TV and movie video shots. Such video material is uncontrolled and extremely challenging. We propose an approach that progressively reduces the search space for body parts, to greatly improve the chances that pose estimation will succeed. This involves two contributions: (i) a generic detector using a weak model of pose to substantially reduce the full pose search space; and (ii) employing 'grabcut' initialized on detected regions proposed by the weak model, to further prune the search space. Moreover, we also propose (Hi) an integrated spatio- temporal model covering multiple frames to refine pose estimates from individual frames, with inference using belief propagation. The method is fully automatic and self-initializing, and explains the spatio-temporal volume covered by a person moving in a shot, by soft-labeling every pixel as belonging to a particular body part or to the background. We demonstrate upper-body pose estimation by an extensive evaluation over 70000 frames from four episodes of the TV series Buffy the vampire slayer, and present an application to full- body action recognition on the Weizmann dataset.

732 citations


Posted Content
TL;DR: In this article, a sparse representation for signals belonging to different classes in terms of a shared dictionary and multiple class-decision functions is proposed, and an optimization framework for learning all the components of the proposed model is presented.
Abstract: It is now well established that sparse signal models are well suited to restoration tasks and can effectively be learned from audio, image, and video data. Recent research has been aimed at learning discriminative sparse models instead of purely reconstructive ones. This paper proposes a new step in that direction, with a novel sparse representation for signals belonging to different classes in terms of a shared dictionary and multiple class-decision functions. The linear variant of the proposed model admits a simple probabilistic interpretation, while its most general variant admits an interpretation in terms of kernels. An optimization framework for learning all the components of the proposed model is presented, along with experimental results on standard handwritten digit and texture classification tasks.

586 citations


Proceedings ArticleDOI
01 Jan 2008
TL;DR: This paper proposes two novel image similarity measures for fast indexing via locality sensitive hashing and an efficient way of exploiting more sophisticated similarity measures that have proven to be essential in image / particular object retrieval.
Abstract: This paper proposes two novel image similarity measures for fast indexing via locality sensitive hashing. The similarity measures are applied and evaluated in the context of near duplicate image detection. The proposed method uses a visual vocabulary of vector quantized local feature descriptors (SIFT) and for retrieval exploits enhanced min-Hash techniques. Standard min-Hash uses an approximate set intersection between document descriptors was used as a similarity measure. We propose an efficient way of exploiting more sophisticated similarity measures that have proven to be essential in image / particular object retrieval. The proposed similarity measures do not require extra computational effort compared to the original measure. We focus primarily on scalability to very large image and video databases, where fast query processing is necessary. The method requires only a small amount of data need be stored for each image. We demonstrate our method on the TrecVid 2006 data set which contains approximately 146K key frames, and also on challenging the University of Kentucky image retrieval database.

515 citations


Proceedings ArticleDOI
01 Jan 2008
TL;DR: It is shown that the ability of Random Forests to combine multiple features leads to a further increase in performance when textons, colour, filterbanks, and HOG features are used simultaneously.
Abstract: This work investigates the use of Random Forests for class based pixel-wise segmentation of images. The contribution of this paper is three-fold. First, we show that apparently quite dissimilar classifiers (such as nearest neighbour matching to texton class histograms) can be mapped onto a Random Forest architecture. Second, based on this insight, we show that the performance of such classifiers can be improved by incorporating the spatial context and discriminative learning that arises naturally in the Random Forest framework. Finally, we show that the ability of Random Forests to combine multiple features leads to a further increase in performance when textons, colour, filterbanks, and HOG features are used simultaneously. The benefit of the multi-feature classifier is demonstrated with extensive experimentation on existing labelled image datasets. The method equals or exceeds the state of the art on these datasets.

257 citations


Proceedings ArticleDOI
23 Jun 2008
TL;DR: This work proposes to group visual objects using a multi-layer hierarchy tree that is based on common visual elements by adapting to the visual domain the generative hierarchical latent Dirichlet allocation (hLDA) model previously used for unsupervised discovery of topic hierarchies in text.
Abstract: Objects in the world can be arranged into a hierarchy based on their semantic meaning (e.g. organism - animal - feline - cat). What about defining a hierarchy based on the visual appearance of objects? This paper investigates ways to automatically discover a hierarchical structure for the visual world from a collection of unlabeled images. Previous approaches for unsupervised object and scene discovery focused on partitioning the visual data into a set of non-overlapping classes of equal granularity. In this work, we propose to group visual objects using a multi-layer hierarchy tree that is based on common visual elements. This is achieved by adapting to the visual domain the generative hierarchical latent Dirichlet allocation (hLDA) model previously used for unsupervised discovery of topic hierarchies in text. Images are modeled using quantized local image regions as analogues to words in text. Employing the multiple segmentation framework of Russell et al. [22], we show that meaningful object hierarchies, together with object segmentations, can be automatically learned from unlabeled and unsegmented image collections without supervision. We demonstrate improved object classification and localization performance using hLDA over the previous non-hierarchical method on the MSRC dataset [33].

255 citations


Journal ArticleDOI
TL;DR: An unsupervised approach for learning a layered representation of a scene from a video for motion segmentation applicable to any video containing piecewise parametric motion using αβ-swap and α-expansion algorithms.
Abstract: We present an unsupervised approach for learning a layered representation of a scene from a video for motion segmentation. Our method is applicable to any video containing piecewise parametric motion. The learnt model is a composition of layers, which consist of one or more segments. The shape of each segment is represented using a binary matte and its appearance is given by the rgb value for each point belonging to the matte. Included in the model are the effects of image projection, lighting, and motion blur. Furthermore, spatial continuity is explicitly modeled resulting in contiguous segments. Unlike previous approaches, our method does not use reference frame(s) for initialization. The two main contributions of our method are: (i) A novel algorithm for obtaining the initial estimate of the model by dividing the scene into rigidly moving components using efficient loopy belief propagation; and (ii) Refining the initial estimate using ? β-swap and ?-expansion algorithms, which guarantee a strong local minima. Results are presented on several classes of objects with different types of camera motion, e.g. videos of a human walking shot with static or translating cameras. We compare our method with the state of the art and demonstrate significant improvements.

Proceedings ArticleDOI
01 Sep 2008
TL;DR: The goal of this work is to detect hand and arm positions over continuous sign language video sequences of more than one hour in length and it is shown that the method is able to identify the true arm and hand locations.
Abstract: The goal of this work is to detect hand and arm positions over continuous sign language video sequences of more than one hour in length. We cast the problem as inference in a generative model of the image. Under this model, limb detection is expensive due to the very large number of possible configurations each part can assume. We make the following contributions to reduce this cost: (i) using efficient sampling from a pictorial structure proposal distribution to obtain reasonable configurations; (ii) identifying a large set of frames where correct configurations can be inferred, and using temporal tracking elsewhere. Results are reported for signing footage with changing background, challenging image conditions, and different signers; and we show that the method is able to identify the true arm and hand locations. The results exceed the state-of-the-art for the length and stability of continuous limb tracking.

Journal ArticleDOI
01 Oct 2008
TL;DR: A novel algorithmic approach to object categorization and detection that can learn category specific detectors, using Boosting, from a visual alphabet of shape and appearance, and shows that incremental learning of a BFM for many categories leads to a sub-linear growth of visual alphabet entries by sharing of shape features.
Abstract: We present a novel algorithmic approach to object categorization and detection that can learn category specific detectors, using Boosting, from a visual alphabet of shape and appearance. The alphabet itself is learnt incrementally during this process. The resulting representation consists of a set of category-specific descriptors--basic shape features are represented by boundary-fragments, and appearance is represented by patches--where each descriptor in combination with centroid vectors for possible object centroids (geometry) forms an alphabet entry. Our experimental results highlight several qualities of this novel representation. First, we demonstrate the power of purely shape-based representation with excellent categorization and detection results using a Boundary-Fragment-Model (BFM), and investigate the capabilities of such a model to handle changes in scale and viewpoint, as well as intra- and inter-class variability. Second, we show that incremental learning of a BFM for many categories leads to a sub-linear growth of visual alphabet entries by sharing of shape features, while this generalization over categories at the same time often improves categorization performance (over independently learning the categories). Finally, the combination of basic shape and appearance (boundary-fragments and patches) features can further improve results. Certain feature types are preferred by certain categories, and for some categories we achieve the lowest error rates that have been reported so far.

Proceedings ArticleDOI
16 Dec 2008
TL;DR: This work focuses on grouping images containing the same object, despite significant changes in scale, viewpoint and partial occlusions, in very large image collections automatically gathered from Flicker, the largest dataset to which image-based data mining has been applied.
Abstract: Automatic organization of large, unordered image collections is an extremely challenging problem with many potential applications. Often, what is required is that images taken in the same place, of the same thing, or of the same person be conceptually grouped together. This work focuses on grouping images containing the same object, despite significant changes in scale, viewpoint and partial occlusions, in very large (1M+) image collections automatically gathered from Flicker. The scale of the data and the extreme variation in imaging conditions makes the problem very challenging. We describe a scalable method that first computes a matching graph over all the images. Image groups can then be mined from this graph using standard clustering techniques. The novelty we bring is that both the matching graph and the clustering methods are able to use the spatial consistency between the images arising from the common object (if there is one). We demonstrate our methods on a publicly available dataset of 5 K images of Oxford, a 37 K image dataset containing images of the Statue of Liberty, and a much larger 1M image dataset of Rome. This is, to our knowledge, the largest dataset to which image-based data mining has been applied.

Journal ArticleDOI
14 Mar 2008
TL;DR: An approach to generalize the concept of text-based search to nontextual information and describes the possibilities of retrieving objects or scenes in a movie with the ease, speed, and accuracy with which Google retrieves web pages containing particular words by specifying the query as an image of the object or scene.
Abstract: We describe an approach to generalize the concept of text-based search to nontextual information. In particular, we elaborate on the possibilities of retrieving objects or scenes in a movie with the ease, speed, and accuracy with which Google retrieves web pages containing particular words, by specifying the query as an image of the object or scene. In our approach, each frame of the video is represented by a set of viewpoint invariant region descriptors. These descriptors enable recognition to proceed successfully despite changes in viewpoint, illumination, and partial occlusion. Vector quantizing these region descriptors provides a visual analogy of a word, which we term a ldquovisual word.rdquo Efficient retrieval is then achieved by employing methods from statistical text retrieval, including inverted file systems, and text and document frequency weightings. The final ranking also depends on the spatial layout of the regions. Object retrieval results are reported on the full length feature films ldquoGroundhog Day,rdquo ldquoCharade,rdquo and ldquoPretty Woman,rdquo including searches from within the movie and also searches specified by external images downloaded from the Internet. We discuss three research directions for the presented video retrieval approach and review some recent work addressing them: 1) building visual vocabularies for very large-scale retrieval; 2) retrieval of 3-D objects; and 3) more thorough verification and ranking using the spatial structure of objects.

Proceedings ArticleDOI
01 Jan 2008
TL;DR: The Geometric Latent Dirichlet Allocation model for unsupervised particular object discovery in unordered image collections is introduced, which explicitly represents documents as mixtures of particular objects or facades, and builds rich latent topic models which incorporate the identity and locations of visual words specific to the topic in a geometrically consistent way.
Abstract: Automatically organizing collections of images presents serious challenges to the current state-of-the art methods in image data mining. Often, what is required is that images taken in the same place, of the same thing, or of the same person be conceptually grouped together. To achieve this, we introduce the Geometric Latent Dirichlet Allocation (gLDA) model for unsupervised particular object discovery in unordered image collections. This explicitly represents documents as mixtures of particular objects or facades, and builds rich latent topic models which incorporate the identity and locations of visual words specific to the topic in a geometrically consistent way. Applying standard inference techniques to this model enables images likely to contain the same object to be probabilistically grouped and ranked. We demonstrate the model on a publicly available dataset of Oxford images, and show examples of spatially consistent groupings.

Proceedings ArticleDOI
01 Dec 2008
TL;DR: The number of training images required can be drastically reduced by synthesizing additional training data using photometric stereo, and the resulting classification performance surpasses the state of the art results.
Abstract: The objective of this work is classifying texture from a single image under unknown lighting conditions. The current and successful approach to this task is to treat it as a statistical learning problem and learn a classifier from a set of training images, but this requires a sufficient number and variety of training images. We show that the number of training images required can be drastically reduced (to as few as three) by synthesizing additional training data using photometric stereo. We demonstrate the method on the PhoTex and ALOT texture databases. Despite the limitations of photometric stereo, the resulting classification performance surpasses the state of the art results.


Book
01 Jan 2008
TL;DR: Image Segmentation in the Presence of Shadows and Highlights, and Key Object Driven Multi-category Object Recognition, Localization and Tracking and Stereo Matching.
Abstract: Segmentation.- Image Segmentation in the Presence of Shadows and Highlights.- Image Segmentation by Branch-and-Mincut.- What Is a Good Image Segment? A Unified Approach to Segment Extraction.- Computational Photography.- Light-Efficient Photography.- Flexible Depth of Field Photography.- Priors for Large Photo Collections and What They Reveal about Cameras.- Understanding Camera Trade-Offs through a Bayesian Analysis of Light Field Projections.- Poster Session IV.- CenSurE: Center Surround Extremas for Realtime Feature Detection and Matching.- Searching the World's Herbaria: A System for Visual Identification of Plant Species.- A Column-Pivoting Based Strategy for Monomial Ordering in Numerical Grobner Basis Calculations.- Co-recognition of Image Pairs by Data-Driven Monte Carlo Image Exploration.- Movie/Script: Alignment and Parsing of Video and Text Transcription.- Using 3D Line Segments for Robust and Efficient Change Detection from Multiple Noisy Images.- Action Recognition with a Bio-inspired Feedforward Motion Processing Model: The Richness of Center-Surround Interactions.- Linking Pose and Motion.- Automated Delineation of Dendritic Networks in Noisy Image Stacks.- Calibration from Statistical Properties of the Visual World.- Regular Texture Analysis as Statistical Model Selection.- Higher Dimensional Affine Registration and Vision Applications.- Semantic Concept Classification by Joint Semi-supervised Learning of Feature Subspaces and Support Vector Machines.- Learning from Real Images to Model Lighting Variations for Face Images.- Toward Global Minimum through Combined Local Minima.- Differential Spatial Resection - Pose Estimation Using a Single Local Image Feature.- Riemannian Anisotropic Diffusion for Tensor Valued Images.- FaceTracer: A Search Engine for Large Collections of Images with Faces.- What Does the Sky Tell Us about the Camera?.- Three Dimensional Curvilinear Structure Detection Using Optimally Oriented Flux.- Scene Segmentation for Behaviour Correlation.- Robust Visual Tracking Based on an Effective Appearance Model.- Key Object Driven Multi-category Object Recognition, Localization and Tracking Using Spatio-temporal Context.- A Pose-Invariant Descriptor for Human Detection and Segmentation.- Texture-Consistent Shadow Removal.- Scene Discovery by Matrix Factorization.- Simultaneous Detection and Registration for Ileo-Cecal Valve Detection in 3D CT Colonography.- Constructing Category Hierarchies for Visual Recognition.- Sample Sufficiency and PCA Dimension for Statistical Shape Models.- Locating Facial Features with an Extended Active Shape Model.- Dynamic Integration of Generalized Cues for Person Tracking.- Extracting Moving People from Internet Videos.- Multiple Instance Boost Using Graph Embedding Based Decision Stump for Pedestrian Detection.- Object Detection from Large-Scale 3D Datasets Using Bottom-Up and Top-Down Descriptors.- Making Background Subtraction Robust to Sudden Illumination Changes.- Closed-Form Solution to Non-rigid 3D Surface Registration.- Implementing Decision Trees and Forests on a GPU.- General Imaging Geometry for Central Catadioptric Cameras.- Estimating Radiometric Response Functions from Image Noise Variance.- Solving Image Registration Problems Using Interior Point Methods.- 3D Face Model Fitting for Recognition.- A Multi-scale Vector Spline Method for Estimating the Fluids Motion on Satellite Images.- Continuous Energy Minimization Via Repeated Binary Fusion.- Unified Crowd Segmentation.- Quick Shift and Kernel Methods for Mode Seeking.- A Fast Algorithm for Creating a Compact and Discriminative Visual Codebook.- A Dynamic Conditional Random Field Model for Joint Labeling of Object and Scene Classes.- Local Regularization for Multiclass Classification Facing Significant Intraclass Variations.- Saliency Based Opportunistic Search for Object Part Extraction and Labeling.- Stereo Matching: An Outlier Confidence Approach.- Improving Shape Retrieval by Learning Graph Transduction.- Cat Head Detection - How to Effectively Exploit Shape and Texture Features.- Motion Context: A New Representation for Human Action Recognition.- Active Reconstruction.- Temporal Dithering of Illumination for Fast Active Vision.- Compressive Structured Light for Recovering Inhomogeneous Participating Media.- Passive Reflectometry.- Fusion of Feature- and Area-Based Information for Urban Buildings Modeling from Aerial Imagery.

01 Jan 2008
TL;DR: Three research directions for the presented video retrieval approach are discussed: 1) building visual vocabularies for very large-scale retrieval; 2) retrieval of 3-D objects; and 3) more thorough verification and ranking using the spatial structure of objects.
Abstract: We describe an approach to generalize the concept of text-based search to nontextual information. In particular, we elaborate on the possibilities of retrieving objects or scenes in a movie with the ease, speed, and accuracy with which Google (9) retrieves web pages containing partic- ular words, by specifying the query as an image of the object or scene. In our approach, each frame of the video is represented by a set of viewpoint invariant region descriptors. These descriptors enable recognition to proceed successfully despite changes in viewpoint, illumination, and partial occlusion. Vector quantizing these region descriptors provides a visual analogy of a word, which we term a Bvisual word.( Efficient retrieval is then achieved by employing methods from statis- tical text retrieval, including inverted file systems, and text and document frequency weightings. The final ranking also de- pends on the spatial layout of the regions. Object retrieval results are reported on the full length feature filmsBGroundhog Day, (B Charade,( and BPretty Woman,( including searches from within the movie and also searches specified by external images downloaded from the Internet. We discuss three research directions for the presented video retrieval approach and review some recent work addressing them: 1) building visual vocabularies for very large-scale retrieval; 2) retrieval of 3-D objects; and 3) more thorough verification and ranking using the spatial structure of objects.

Proceedings Article
01 Jan 2008


01 Jan 2008
TL;DR: For example, the Oxford/IIIT team as mentioned in this paper used a random forest classifier for high-level feature extraction and a SVM classifier using a linear combination of kernels for interactive search.
Abstract: The Oxford/IIIT team participated in the high-level feature extraction and interactive search tasks. A vision only approach was used for both tasks, with no use of the text or audio information. For the high-level feature extraction task, we used two different approaches, both based on a combination of visual features. One used a SVM classifier using a linear combination of kernels, the other used a random forest classifier. For both methods, we trained all high-level features using publicly available annotations [3]. The advantage of the random forest classifier is the speed of training and testing. In addition, for thepeople feature, we took a more targeted approach. We used a real-time face detector and an upper body detector, in both cases running on every frame. Our best performing submission, C OXVGG 1 1, which used a rank fusion of our random forest and SVM approach, achieved an mAP of 0.101 and was above the median for all but one feature. In the interactive search task, our team came third overall with an mAP of 0.158. The system used was identical to last year with the only change being a source of accurate upper body detections.

01 Jan 2008
TL;DR: The Oxford/IIIT team participated in the high-level feature extraction and interactive search tasks and a vision only approach was used for both tasks, with no use of the text or audio information.
Abstract: The Oxford/IIIT team participated in the high-level feature extraction and interactive search tasks. A vision only approach was used for both tasks, with no use of the text or audio information. For the high-level feature extraction task, we used two different approaches, both based on a combination of visual features. One used a SVM classifier using a linear combination of kernels, the other used a random forest classifier. For both methods, we trained all high-level features using publicly available annotations [24]. The advantage of the random forest classifier is the speed of training and testing. In addition, for thepeople feature, we took a more targeted approach. We used a real-time face detector and an upper body detector, in both cases running on every frame. One run C OXVGG 4 4 was submitted using only Random forest approach for all concepts except two people. It performed best on Dog and hand, and came over the median for all the classes except kitchen and airplane. In the interactive search task, our team came third overall with an mAP of 0.158. The system used was identical to last year with the only change being a source of accurate upper body detections.


Proceedings ArticleDOI
07 Jul 2008
TL;DR: This system explores how novel techniques from Computer Vision can be used to search in cases where the subtitle text is uninformative (e.g. actions, particular objects).
Abstract: Fast multimedia retrieval in large video datasets remains an extremely challenging problem. This system explores how novel techniques from Computer Vision can be used to search in cases where the subtitle text is uninformative (e.g. actions, particular objects).