scispace - formally typeset
Search or ask a question

Showing papers presented at "British Machine Vision Conference in 2010"


Proceedings ArticleDOI
01 Jan 2010
TL;DR: A new annotated database of challenging consumer images is introduced, an order of magnitude larger than currently available datasets, and over 50% relative improvement in pose estimation accuracy over a state-of-the-art method is demonstrated.
Abstract: We investigate the task of 2D articulated human pose estimation in unconstrained still images This is extremely challenging because of variation in pose, anatomy, clothing, and imaging conditions Current methods use simple models of body part appearance and plausible configurations due to limitations of available training data and constraints on computational expense We show that such models severely limit accuracy Building on the successful pictorial structure model (PSM) we propose richer models of both appearance and pose, using state-of-the-art discriminative classifiers without introducing unacceptable computational expense We introduce a new annotated database of challenging consumer images, an order of magnitude larger than currently available datasets, and demonstrate over 50% relative improvement in pose estimation accuracy over a stateof-the-art method

914 citations


Proceedings ArticleDOI
01 Jan 2010
TL;DR: This work converts the person re-identification problem from an absolute scoring p roblem to a relative ranking problem and develops an novel Ensemble RankSVM to overcome the scalability limitation problem suffered by existing SVM-based ranking methods.
Abstract: Solving the person re-identification problem involves matching observation s of individuals across disjoint camera views. The problem becomes particularly hard in a busy public scene as the number of possible matches is very high. This is further compounded by significant appearance changes due to varying lighting conditions, vie wing angles and body poses across camera views. To address this problem, existing approaches focus on extracting or learning discriminative features followed by template matching using a distance measure. The novelty of this work is that we reformulate the person reidentification problem as a ranking problem and learn a subspace where th e potential true match is given highest ranking rather than any direct distance measure. By doing so, we convert the person re-identification problem from an absolute scoring p roblem to a relative ranking problem. We further develop an novel Ensemble RankSVMto overcome the scalability limitation problem suffered by existing SVM-based ranking methods. This new model reduces significantly memory usage therefore is much more scalable, whilst maintaining high-level performance. We present extensive experiments to demonstrate the performance gain of the proposed ranking approach over existing template matching and classification models.

736 citations


Proceedings ArticleDOI
01 Sep 2010
TL;DR: A technique to avoid constructing such a finely sampled image pyramid without sacrificing performance is proposed, and for a broad family of features, including gradient histograms, the feature responses computed at a single scale can be used to approximate feature responses at nearby scales.
Abstract: We demonstrate a multiscale pedestrian detector operating in near real time ( 6 fps on 640x480 images) with state-of-the-art detection performance. The computational bottleneck of many modern detectors is the construction of an image pyramid, typically sampled at 8-16 scales per octave, and associated feature computations at each scale. We propose a technique to avoid constructing such a finely sampled image pyramid without sacrificing performance: our key insight is that for a broad family of features, including gradient histograms, the feature responses computed at a single scale can be used to approximate feature responses at nearby scales. The approximation is accurate within an entire scale octave. This allows us to decouple the sampling of the image pyramid from the sampling of detection scales. Overall, our approximation yields a speedup of 10-100 times over competing methods with only a minor loss in detection accuracy of about 1-2% on the Caltech Pedestrian dataset across a wide range of evaluation settings. The results are confirmed on three additional datasets (INRIA, ETH, and TUD-Brussels) where our method always scores within a few percent of the state-of-the-art while being 1-2 orders of magnitude faster. The approach is general and should be widely applicable.

680 citations


Proceedings ArticleDOI
31 Aug 2010
TL;DR: The role of background scene context is investigated and it is demonstrated that improved action recognition performance can be achieved by combining the statistical and part-based representations, and integrating person-centric description with the Background scene context.
Abstract: Recognition of human actions is usually addressed in the scope of video interpretation. Meanwhile, common human actions such as ''reading a book'', ''playing a guitar'' or ''writing notes'' also provide a natural description for many still images. In addition, some actions in video such as ''taking a photograph'' are static by their nature and may require recognition methods based on static cues only. Motivated by the potential impact of recognizing actions in still images and the little attention this problem has received in computer vision so far, we address recognition of human actions in consumer photographs. We construct a new dataset available at http://www.di.ens.fr/willow/research/stillactions/ with seven classes of actions in 911 Flickr images representing natural variations of human actions in terms of camera view-point, human pose, clothing, occlusions and scene background. We study action recognition in still images using the state-of-the-art bag-of-features methods as well as their combination with the part-based Latent SVM approach of Felzenszwalb et al. In particular, we investigate the role of background scene context and demonstrate that improved action recognition performance can be achieved by (i) combining the statistical and part-based representations, and (ii) integrating person-centric description with the background scene context. We show results on our newly collected dataset of seven common actions as well as demonstrate improved performance over existing methods on the datasets of Gupta et al. and Yao and Fei-Fei.

295 citations


Proceedings ArticleDOI
01 Jan 2010
TL;DR: This work proposes a novel energy formulation which incorporates both segmentation and motion estimation in a single framework, and utilizes state-of-the-art methods to efficiently optimize over a large number of discrete labels.
Abstract: We present a novel off-line algorithm for target segmentation and tracking in video. In our approach, video data is represented by a multi-label Markov Random Field model, and segmentation is accomplished by finding the minimum energy label assignment. We propose a novel energy formulation which incorporates both segmentation and motion estimation in a single framework. Our energy functions enforce motion coherence both within and across frames. We utilize state-of-the-art methods to efficiently optimize over a large number of discrete labels. In addition, we introduce a new ground-truth dataset, called SegTrack, for the evaluation of segmentation accuracy in video tracking. We compare our method with two recent on-line tracking algorithms and provide quantitative and qualitative performance comparisons.

217 citations


Proceedings ArticleDOI
01 Jan 2010
TL;DR: This paper goes back to the ideas from the early days of computer vision, by using 3D object models as the only source of information for building a multi-view object class detector and uses these models for learning 2D shape that can be robustly matched to 2D natural images.
Abstract: Recognizing 3D objects from arbitrary view points is one of the most fundamental problems in computer vision. A major challenge lies in the transition between the 3D geometry of objects and 2D representations that can be robustly matched to natural images. Most approaches thus rely on 2D natural images either as the sole source of training data for building an implicit 3D representation, or by enriching 3D models with natural image features. In this paper, we go back to the ideas from the early days of computer vision, by using 3D object models as the only source of information for building a multi-view object class detector. In particular, we use these models for learning 2D shape that can be robustly matched to 2D natural images. Our experiments confirm the validity of our approach, which outperforms current state-of-the-art techniques on a multi-view detection data set.

193 citations


Proceedings ArticleDOI
01 Jan 2010
TL;DR: An efficient version of SBA for systems where the secondary structure (relations among cameras) is also sparse, which outperforms the current SBA standard implementation on datasets with sparse secondary structure by at least an order of magnitude, while also being more efficient on dense datasets.
Abstract: Sparse Bundle Adjustment (SBA) is a method for simultaneously optimizing a set of camera poses and visible points. It exploits the sparse primary structure of the problem, where connections exist just between points and cameras. In this paper, we implement an efficient version of SBA for systems where the secondary structure (relations among cameras) is also sparse. The method, which we call Sparse SBA (sSBA), integrates an efficient method for setting up the linear subproblem with recent advances in direct sparse Cholesky solvers. sSBA outperforms the current SBA standard implementation on datasets with sparse secondary structure by at least an order of magnitude, while also being more efficient on dense datasets.

181 citations


Proceedings ArticleDOI
01 Jan 2010
TL;DR: A per-person descriptor that uses attention (head orientation) and the local spatial and temporal context in a neighbourhood of each detected person and a structured SVM that combines head orientation and the relative location of people in a frame to improve upon the initial classification obtained with the descriptor.
Abstract: In this paper we address the problem of recognising interactions between two people in realistic scenarios for video retrieval purposes. We develop a per-person descriptor that uses attention (head orientation) and the local spatial and temporal context in a neighbourhood of each detected person. Using head orientation mitigates camera view ambiguities, while the local context, comprised of histograms of gradients and motion, aims to capture cues such as hand and arm movement. We also employ structured learning to capture spatial relationships between interacting individuals. We train an initial set of one-vs-the-rest linear SVM classifiers, one for each interaction, using this descriptor. Noting that people generally face each other while interacting, we learn a structured SVM that combines head orientation and the relative location of people in a frame to improve upon the initial classification obtained with our descriptor. To test the efficacy of our method, we have created a new dataset of realistic human interactions comprised of clips extracted from TV shows, which represents a very difficult challenge. Our experiments show that using structured learning improves the retrieval results compared to using the interaction classifiers independently.

180 citations


Proceedings ArticleDOI
03 Sep 2010
TL;DR: This method is able to recognise actions continuously in real-time while achieving comparably high accuracy over state-of-the-arts while introducing the pyramidal spatiotemporal relationship match (PSRM) to encapsulate both local appearance and structural information efficiently.
Abstract: This paper presents a novel real-time action recogniser by utilising both local appearance and structural information. Our method is able to recognise actions continuously in real-time while achieving comparably high accuracy over state-of-the-arts. Run-time speed is of vital importance in real-world action recognition systems, but existing methods seldom take computational complexity into full consideration. A class label is assigned after an entire query video is analysed, or a large lookahead is required to recognise an action. In addition, the “bag of words”(BOW) has proven effective for action recognition [5]. However, the standard BOW model ignores the spatiotemporal relationships among feature descriptors, which are useful for describing actions. Addressing these challenges, we present a novel approach for action recognition. The major contributions include the followings: Efficient Spatiotemporal Codebook Learning: We extend the use of semantic texton forests [6] (STFs) from 2D image segmentation to spatiotemporal analysis. As well as being much faster than a traditional flat codebook such as k-means clustering, STFs achieve high accuracy comparable to that of existing approaches. STFs are ensembles of random decision trees that textonise input video patches into semantic textons. Since only a small number of simple features are used to traverse the trees, STFs are extremely fast to evaluate. They also serve a powerful discriminative codebook by multiple decision trees. Figure 1 illustrates how visual codewords are generated using STFs in the proposed method. Combined Structural and Appearance Information: We propose a richer description of features, hence actions can be classified in very short video sequences. Based on [3], we introduce the pyramidal spatiotemporal relationship match (PSRM) to encapsulate both local appearance and structural information efficiently. Subsequences are sampled from an input video in short intervals (e.g. ≤ 10 frames). After spatiotemporal interest points are localised, the trained STFs assign visual codewords to the features. A set of pairwise spatiotemporal associations are designed to capture the structural relationships among features (i.e. pairwise distances along space-time axes). All possible pairs in the bag of features are analysed by the association rules and stored in the 3-D histogram. PSRM leverages the properties of semantic trees and pyramidal match kernels. Multiple pyramidal histograms are then combined to classify a query video. Figure 2 illustrates how the relationship histograms are constructed and matched using PSRM. For each tree in STFs, the threedimensional histogram is constructed according to their spatiotemporal structures (see figure 2 (left)). Its hierarchical structure offers a time efficient way to perform the pyramid match kernel [1] for codeword matching (figure 2 (right)). Enhanced Efficiency and Combined Classification: Several techniques are employed to improve the recognition speed and accuracy. A novel spatiotemporal interest point detector, called V-FAST, is designed based on the FAST 2D corners [2]. The recognition accuracy is enhanced by adaptively combining PSRM and the bag of semantic texton (BOST) method [6]: the k-means forest classifier is learned using PSRM as a matching kernel. The task of action recognition is performed separately Spatiotemporal Relationship Match of visual codewords from Semantic Texton Forest Pyramid Match Kernel is utilised to match the histograms Feature Extraction Feature Matching

152 citations


Proceedings ArticleDOI
01 Jan 2010
TL;DR: This work looks at the application of the recent extension to the seminal SIFT approach to the 3D volumetric recognition of rigid objects within this complexvolumetric environment including significant noise artefacts.
Abstract: The automatic detection of objects within complex volumetric imagery is becoming of increased interest due to the use of dual energy Computed Tomography (CT) scanners as an aviation security deterrent. These devices produce a volumetric image akin to that encountered in prior medical CT work but in this case we are dealing with a complex multi-object volumetric environment including significant noise artefacts. In this work we look at the application of the recent extension to the seminal SIFT approach to the 3D volumetric recognition of rigid objects within this complex volumetric environment. A detailed overview of the approach and results when applied to a set of exemplar CT volumetric imagery is presented.

143 citations


Proceedings ArticleDOI
01 Jan 2010
TL;DR: This work uses a Conditional Random Field (CRF) model to discover and exploit contextual information, classifying planar patches extracted from the point cloud data and finds that using certain contextual information along with local features leads to better classification results.
Abstract: Semantic 3D models of buildings encode the geometry as well as the identity of key components of a facility, such as walls, floors, and ceilings. Manually constructing such a model is a time-consuming and error-prone process. Our goal is to automate this process using 3D point data from a laser scanner. Our hypothesis is that contextual information is important to reliable performance in unmodified environments, which are often highly cluttered. We use a Conditional Random Field (CRF) model to discover and exploit contextual information, classifying planar patches extracted from the point cloud data. We compare the results of our context-based CRF algorithm with a context-free method based on L2 norm regularized Logistic Regression (RLR). We find that using certain contextual information along with local features leads to better classification results.

Proceedings ArticleDOI
01 Jan 2010
TL;DR: This work decompose video into region classes and augment local features with corresponding region-class labels and demonstrates how this information can be integrated with BoF representations in a kernel-combination framework.
Abstract: Local space-time features have recently shown promising results within Bag-of-Features (BoF) approach to action recognition in video. Pure local features and descriptors, however, provide only limited discriminative power implying ambiguity among features and sub-optimal classification performance. In this work, we propose to disambiguate local space-time features and to improve action recognition by integrating additional nonlocal cues with BoF representation. For this purpose, we decompose video into region classes and augment local features with corresponding region-class labels. In particular, we investigate unsupervised and supervised video segmentation using (i) motion-based foreground segmentation, (ii) person detection, (iii) static action detection and (iv) object detection. While such segmentation methods might be imperfect, they provide complementary region-level information to local features. We demonstrate how this information can be integrated with BoF representations in a kernel-combination framework. We evaluate our method on the recent and challenging Hollywood-2 action dataset and demonstrate significant improvements.

Proceedings ArticleDOI
01 Jan 2010
TL;DR: A novel approach to cross-view gait recognition with the view angle of a probe gait sequence unknown is developed, which can cope with feature mis-match across view and is more robust against feature noise.
Abstract: Among various factors that can affect the performance of gait recognition, changes in viewpoint pose the biggest problem. In this work, we develop a novel approach to cross-view gait recognition with the view angle of a probe gait sequence unknown. We formulate a Gaussian Process (GP) classification framework to estimate the view angle of each probe gait sequence. To measure the similarity of gait sequences captured at different view angles, we model the correlation of gait sequences from different views using Canonical Correlation Analysis (CCA) and use the correlation strength as similarity measure. This differs significantly from existing approaches, which reconstruct gait features in different views either through 2D view transformation or 3D calibration. Without explicit reconstruction, our approach can cope with feature mis-match across view and is more robust against feature noise. Our experiments validate that the proposed method significantly outperforms the existing state-of-the-art methods.

Proceedings ArticleDOI
01 Jan 2010
TL;DR: This paper completes the construction and combines the two techniques to obtain explicit feature maps for the generalized RBF kernels, and investigates a learning method using l 1 regularization to encourage sparsity in the final vector representation, and thus reduce its dimension.
Abstract: Kernel methods yield state-of-the-art performance in certain applications such as image classification and object detection. However, large scale problems require machine learning techniques of at most linear complexity and these are usually limited to linear kernels. This unfortunately rules out gold-standard kernels such as the generalized RBF kernels (e.g. exponential-c 2 ). Recently, Maji and Berg [13] and Vedaldi and Zisserman [20] proposed explicit feature maps to approximate the additive kernels (intersection, c 2 , etc.) by linear ones, thus enabling the use of fast machine learning technique in a non-linear context. An analogous technique was proposed by Rahimi and Recht [14] for the translation invariant RBF kernels. In this paper, we complete the construction and combine the two techniques to obtain explicit feature maps for the generalized RBF kernels. Furthermore, we investigate a learning method using l 1 regularization to encourage sparsity in the final vector representation, and thus reduce its dimension. We evaluate this technique on the VOC 2007 detection challenge, showing when it can improve on fast additive kernels, and the trade-offs in complexity and accuracy.


Proceedings ArticleDOI
31 Aug 2010
TL;DR: A family of object detectors that provides state-of-the-art error rates on several important datasets including INRIA people and PASCAL V OC'06 and VOC'07 is described and partial Least Squares dimensionality reduction is included to speed the training of the basic classifier with no loss of accuracy.
Abstract: We describe a family of object detectors that provides state-of-the-art error rates on several important datasets including INRIA people and PASCAL VOC'06 and VOC'07. The method builds on a number of recent advances. It uses the Latent SVM learning framework and a rich visual feature set that incorporates Histogram of Oriented Gradient, Local Binary Pattern and Local Ternary Pattern descriptors. Partial Least Squares dimensionality reduction is included to speed the training of the basic classifier with no loss of accuracy, and to allow a two-stage quadratic classifier that further improves the results. \iflong A simple sparsification technique can reduce the size of the feature set by around 70% with little loss of accuracy.\fi We evaluate our methods and compare them to other recent ones on several datasets. Our basic root detectors outperform the single component part-based ones of Felzenszwalb et.al on 9 of 10 classes of VOC'06 (12% increase in Mean Average Precision) and 11 of 20 classes of VOC'07 (7% increase in MAP). On the INRIA Person dataset, they increase the Average Precision by 12% relative to Dalal \& Triggs.

Proceedings ArticleDOI
01 Jan 2010
TL;DR: An unsupervised learning procedure based on Kernel Canonical Correlation Analysis is proposed that discovers the relationship between how humans tag images and the relative importance of objects and their layout in the scene.
Abstract: We introduce a method for image retrieval that leverages the implicit information about object importance conveyed by the list of keyword tags a person supplies for an image. We propose an unsupervised learning procedure based on Kernel Canonical Correlation Analysis that discovers the relationship between how humans tag images (e.g., the order in which words are mentioned) and the relative importance of objects and their layout in the scene. Using this discovered connection, we show how to boost accuracy for novel queries, such that the search results may more closely match the user’s mental image of the scene being sought. We evaluate our approach on two datasets, and show clear improvements over both an approach relying on image features alone, as well as a baseline that uses words and image features, but ignores the implied importance cues.

Proceedings ArticleDOI
01 Jan 2010
TL;DR: A novel framework for recognising realistic human actions in unconstrained environments based on computing a rich set of descriptors from key point trajectories is presented and an adaptive feature fusion method to combine different local motion descriptors for improving model robustness against feature noise and background clutters is developed.
Abstract: Problem This paper addresses the problem of recognising realistic human actions captured in unconstrained environments (Fig. 1). Existing approaches for action recognition have been focused on improving visual feature representation using either spatio-temporal interest points or key-points trajectories. However, these methods are insufficient to handle the situations when action videos are recorded in unconstrained environments because: (1) Reliable visual features are hard to be extracted due to occlusions, illumination change, scale variation and background clutters. (2) Effectiveness of visual features are strongly dependent on the unpredictable characteristics of camera movements. (3) Complicated visual actions result in unequal discriminativeness of visual features. Our Solutions In this paper, we present a novel framework for recognising realistic human actions in unconstrained environments. The novelties of our work lie in three aspects: First, we propose a new action representation based on computing a rich set of descriptors from key point trajectories. Second, in order to cope with drastic changes in motion characteristics with and without camera movements, we develop an adaptive feature fusion method to combine different local motion descriptors for improving model robustness against feature noise and background clutters. Finally, we propose a novel Multi-Class Delta Latent Dirichlet Allocation (MC-∆LDA) model for feature selection. The most informative features in a high dimensional feature space are selected collaboratively rather than independently. Motion Descriptors We first compute trajectories of key-points using KLT tracker and SIFT matching. After trajectory pruning by identifying the Region of Interest (ROI), we compute three types of motion descriptors from the survived trajectories. First, Orientation-Magnitude Descriptor is extracted by quantising orientation and magnitude of motion between two consecutive points in the same trajectory. Second, Trajectory Shape Descriptor is extracted by computing Fourier coefficients of a single trajectory. Finally, Appearance Descriptor is extracted by computing the SIFT features at all points of a trajectory. Interest Point Features We also detect spatio-temporal interest points as they contain complementary information to trajectory features. At an interest point, a surrounding 3D cuboid is extracted. We use gradient vectors to describe these cuboids and PCA to reduce descriptor’s dimensionality. Adaptive Feature Fusion We wish to fuse adaptively trajectory based descriptors with 3D interest point based descriptors according the presence of camera movement. The presence of moving camera is detected by computing the global optical flow over all frames in a clip. If the majority of the frames contain global motion, we regard the clip as being recorded by a moving camera. For clips without camera movement, both interest point and trajectory based descriptors can be computed reliably and thus both types of descriptors are used for recognition. In contrast, when camera motion can be detected, interest point based descriptors are less meaningful so only trajectory descriptors are employed. Collaborative Feature Selection We propose a MC-∆LDA model (Fig. 2) for collaboratively selecting dominant features for classification. We consider each video clip x j is a mixture of Nt topics Φ = {φt}t t=1 (to be discovered), each of which φt is a multinomial distribution over Nw words (visual features). The MC-∆LDA model aims to constrain topic proportion non-uniformly and on a per-clip basis. For each video clip belonging to action category Ac, we model it as a mixture of: (1) Ns t topics which are shared by all Nc category of actions, and (2) Nt,c topics which are uniquely associated with action category Ac. In MC-∆LDA, the nonuniform proportion of topic mixture for a single clip x j is enforced by its action class label c j and the hyperparameter αc for the corresponding action class c. Given the total number of topics Nt = Ns t +∑ Nc c=1 Nt,c, the structure of the MC-∆LDA model, and the observable variables (clips x j and action labels c j), we can learn the Ns t shared topics as well as all ∑c c=1 Nt,c unique topics for all Nc classes of actions. We use the N s t topics shared by all actions for selecting discriminative features. The Ns t shared topics are represented as an Nw×N t dimension matrix Φs. The feature selection can be summarised into two steps: (1) For each feature vk, k = Figure 1: Actions captured in an unconstrained environments, YouTube dataset. From left to right: cycling, diving, soccer juggling, and walking with a dog.

Proceedings ArticleDOI
20 Aug 2010
TL;DR: Comparative evaluation using a common experimental setup on GAVAB dataset, considered as the most expression-rich and noise-prone 3D face dataset, shows that the approach outperforms other state-of-the-art approaches.
Abstract: In this paper we explore the use of shapes of elastic radial curves to model 3D facial deformations, caused by changes in facial expressions. We represent facial surfaces by indexed collections of radial curves on them, emanating from the nose tips, and compare the facial shapes by comparing the shapes of their corresponding curves. Using a past approach on elastic shape analysis of curves, we obtain an algorithm for comparing facial surfaces. We also introduce a quality control module which allows our approach to be robust to pose variation and missing data. Comparative evaluation using a common experimental setup on GAVAB dataset, considered as the most expression-rich and noise-prone 3D face dataset, shows that our approach outperforms other state-of-the-art approaches.

Proceedings ArticleDOI
01 Jan 2010
TL;DR: This paper drops the assumption that the target is known with enough precision and adjusts it in an iterative way as part of the whole process to obtain a very accurate camera calibration that outperforms those obtained with well-known standard techniques.
Abstract: Accurate intrinsic camera calibration is essential to any computer vision task that involves image based measurements. Given its crucial role with respect to precision, a large number of approaches have been proposed over the last decades. Despite this rich literature, steady advancements in imaging hardware regularly push forward the need for even more accurate techniques. Some authors suggest generalizations of the camera model itself, others propose novel designs for calibration targets or different optimization schemas. In this paper we take a completely different route by directly addressing one of the most overlooked problems in practical calibration scenarios. Specifically, we drop the assumption that the target is known with enough precision and we adjust it in an iterative way as part of the whole process. This is in fact the case with the typical target used in most of the calibration literature, which is usually printed on paper and stitched on a flat surface. In the experimental section we show that even with such a cheaply crafted target it is possible to obtain a very accurate camera calibration that outperforms those obtained with well-known standard techniques.

Proceedings ArticleDOI
01 Jan 2010
TL;DR: A simple algorithm that uses a dataset with manually marked salient objects to learn to detect saliency based on image superpixels, as opposed to individual image pixels, which shows a significant advantage of the approach over previous work.
Abstract: Saliency detection is a well researched problem in computer vision. In previous work, most of the effort is spent on manually devising a saliency measure. Instead we propose a simple algorithm that uses a dataset with manually marked salient objects to learn to detect saliency. Building on the recent success of segmentation-based approaches to object detection, our saliency detection is based on image superpixels, as opposed to individual image pixels. Our features are the standard ones often used in vision, i.e. they are based on color, texture, etc. These simple features, properly normalized, surprisingly have a performance superior to the methods with hand-crafted features specifically designed for saliency detection. We refine the initial segmentation returned by the learned classifier by performing binary graph-cut optimization. This refinement step is performed on pixel level to alleviate any potential inaccuracies due to superpixel tesselation. The initial appearance models are updated in an iterative segmentation framework. To insure that the classifier results are not completely ignored during later iterations, we incorporate classifier confidences into our graph-cut refinement. Evaluation on the standard datasets shows a significant advantage of our approach over previous work.

Proceedings ArticleDOI
01 Sep 2010
TL;DR: A method for training additive models that is several times faster than the standard approach without sacrificing accuracy is demonstrated, and it is demonstrated that linear additive models can serve as an effective substitute for linear regression.
Abstract: The Active Appearance Model (AAM) provides an efficient method for localizing objects that vary in both shape and texture, and uses a linear regressor to predict updates to model parameters based on current image residuals. This study investigates using additive (or ‘boosted’) predictors, both linear and non-linear, as a substitute for the linear predictor in order to improve accuracy and efficiency. We demonstrate: (a) a method for training additive models that is several times faster than the standard approach without sacrificing accuracy; (b) that linear additive models can serve as an effective substitute for linear regression; (c) that linear models are as effective as non-linear models when close to the true solution. Based on these observations, we compare a ‘hybrid’ AAM to the standard AAM for both the XM2VTS and BioID datasets, including cross-dataset evaluations.

Proceedings ArticleDOI
01 Jan 2010
TL;DR: This paper learns empirically the generalization error of a 3D morphable model using out-of-sample data and incorporates this into a parameter-free probabilistic framework which allows 3D shape recovery of a face in an arbitrary pose in a single step.
Abstract: In this paper, we present a robust and efficient method to statistically recover the full 3D shape and texture of faces from single 2D images. We separate shape and texture recovery into two linear problems. For shape recovery, we learn empirically the generalization error of a 3D morphable model using out-of-sample data. We use this to predict the 2D variance associated with a sparse set of 2D feature points. This knowledge is incorporated into a parameter-free probabilistic framework which allows 3D shape recovery of a face in an arbitrary pose in a single step. Under the assumption of diffuseonly reflectance, we also show how photometric invariants can be used to recover texture parameters in an illumination insensitive manner. We present empirical results with comparison to the state-of-the-art analysis-by-synthesis methods and show an application of our approach to adjusting the pose of subjects in oil paintings.

PatentDOI
30 Jun 2010
TL;DR: In this article, an active learning method is proposed to train a compact classifier for view-based object recognition by searching for local minima of a classifier's output in a low-dimensional space of rendering parameters.
Abstract: An “active learning” method trains a compact classifier for view-based object recognition. The method actively generates its own training data. Specifically, the generation of synthetic training images is controlled within an iterative training process. Valuable and/or informative object views are found in a low-dimensional rendering space and then added iteratively to the training set. In each iteration, new views are generated. A sparse training set is iteratively generated by searching for local minima of a classifier's output in a low-dimensional space of rendering parameters. An initial training set is generated. The classifier is trained using the training set. Local minima are found of the classifier's output in the low-dimensional rendering space. Images are rendered at the local minima. The newly-rendered images are added to the training set. The procedure is repeated so that the classifier is retrained using the modified training set.

Proceedings ArticleDOI
01 Jan 2010
TL;DR: A simple and flexible extrinsic calibration method for nonoverlapping camera rig, for visual navigation purpose in urban environment, with main contributions a study of the singular motions and a specific bundle adjustment which both reconstructs the scene and calibrates the cameras.
Abstract: Multi-camera systems are more and more used in vision-based robotics. An accurate extrinsic calibration is usually required. In most of cases, this task is done by matching features through different views of the same scene. However, if the cameras fields of view do not overlap, such a matching procedure is not feasible anymore. This article deals with a simple and flexible extrinsic calibration method, for nonoverlapping camera rig. The aim is the calibration of non-overlapping cameras embedded on a vehicle, for visual navigation purpose in urban environment. The cameras do not see the same area at the same time. The calibration procedure consists in manoeuvring the vehicle while each camera observes a static scene. The main contributions are a study of the singular motions and a specific bundle adjustment which both reconstructs the scene and calibrates the cameras. Solutions to handle the singular configurations, such as planar motions, are exposed. The proposed approach has been validated with synthetic and real data.

Proceedings ArticleDOI
03 Sep 2010
TL;DR: A novel directed graphical model for label propagation in lengthy and complex video sequences using a hybrid of generative propagation and discriminative classification in a pseudo time-symmetric video model to achieve a conservative labelling of the video.
Abstract: We propose a novel directed graphical model for label propagation in lengthy and complex video sequences. Given hand-labelled start and end frames of a video sequence, a variational EM based inference strategy propagates either one of several class labels or assigns an unknown class (void) label to each pixel in the video. These labels are used to train a multi-class classifier. The pixel labels estimated by this classifier are injected back into the Bayesian network for another iteration of label inference. The novel aspect of this iterative scheme, as compared to a recent approach [1], is its ability to handle occlusions. This is attributed to a hybrid of generative propagation and discriminative classification in a pseudo time-symmetric video model. The end result is a conservative labelling of the video; large parts of the static scene are labelled into known classes, and a void label is assigned to moving objects and remaining parts of the static scene. These labels can be used as ground truth data to learn the static parts of a scene from videos of it or more generally for semantic video segmentation. We demonstrate the efficacy of the proposed approach using extensive qualitative and quantitative tests over six challenging sequences. We bring out the advantages and drawbacks of our approach, both to encourage its repeatability and motivate future research directions.

Proceedings ArticleDOI
31 Aug 2010
TL;DR: The overall method combines the distinctiveness of multiple local, activity-specific motion models into a global model capable of recognising and tracking multiple activities from simple observations.
Abstract: Recent technological advances have lead to the development of cameras that measure depth by means of the time-of-flight (ToF) principle [5]. ToF cameras allow capturing an entire scene instantaneously, and thus provide depth images in real-time. Despite the relatively low resolution, this type of data offers a clear advantage over conventional cameras for specific applications, such as human-machine interaction. In this paper, we propose a method that allows simultaneously recognizing the performed activity and tracking the full-body pose of a person observed by a a single ToF camera. Our method removes the need for identifying body parts in sparse and noisy ToF images [4] or for fitting a skeleton using expensive optimisation techniques [1]. The proposed method consists of learning a prior model of human motion and using an efficient, sampling-based inference approach for activity recognition and body tracking (Figure 1). The prior motion model is comprised of a set of low-dimensional manifold embeddings for each activity of interest. We generate the embeddings from full-body pose training data using a manifold learning technique [2]. Each of the embeddings acts as a low-dimensional parametrisation of feasible body poses [3] that we use to constrain the problem of body tracking only from depth cues. In a generative tracking framework, we sample the low-dimensional manifold embedding space by means of a particle filter and thus avoid exhaustively searching the full-body pose space. This way, we are able to track multiple pose hypotheses for different activities and to select one that is most consistent with the observed depth cues. Our depth feature descriptor, intuitively a sparse 3D human silhouette representation, can easily be extracted from ToF images. The overall method combines the distinctiveness of multiple local, activity-specific motion models into a global model capable of recognising and tracking multiple activities from simple observations.

Proceedings ArticleDOI
01 Jan 2010
TL;DR: The method builds on an existing MRF formulation incorporating a prior shape model and colour distributions for the constituent parts and proposes a novel shape model consisting of a deformable spatial prior probability for the part-label at each pixel.
Abstract: We present a method for segmenting the parts of multiple instances of a known object category exhibiting large variations in projected shape and colour. The method builds on an existing MRF formulation incorporating a prior shape model and colour distributions for the constituent parts. We propose a novel shape model consisting of a deformable spatial prior probability for the part-label at each pixel. We also make a simple extension to the MRF formulation to deal simultaneously with multiple objects within a global optimisation. Finally, we evaluate the method for the task of segmenting individual items of clothing in images depicting groups of people, and demonstrate improved performance against the state of the art for this task.

Proceedings ArticleDOI
31 Aug 2010
TL;DR: A robust geometric stereorectification method by a three-step camera rotation is proposed and mathematically explained and shows that the algorithm has an accuracy comparable to the state-of-art, but finds the right minimum in cases where other methods fail, namely when the epipolar lines are far from horizontal.
Abstract: Image stereo-rectification is the process by which two images of the same solid scene undergo homographic transforms, so that their corresponding epipolar lines coincide and become parallel to the x-axis of image. A pair of stereo-rectified images is helpful for dense stereo matching algorithms. It restricts the search domain for each match to a line parallel to the x-axis. Due to the redundant degrees of freedom, the solution to stereorectification is not unique and actually can lead to undesirable distortions or be stuck in a local minimum of the distortion function. In this paper a robust geometric stereorectification method by a three-step camera rotation is proposed and mathematically explained. Unlike other methods which reduce the distortion by explicitly minimizing an empirical measure, the intuitive geometric camera rotation angle is minimized at each step. For un-calibrated cameras, this method uses an efficient minimization algorithm by optimizing only one natural parameter, the focal length. This is in contrast with all former methods which optimize between 3 and 6 parameters. Comparative experiments show that the algorithm has an accuracy comparable to the state-of-art, but finds the right minimum in cases where other methods fail, namely when the epipolar lines are far from horizontal.

Proceedings ArticleDOI
01 Jan 2010
TL;DR: It is demonstrated that an off-line trained class-specific detector can be transformed into an instance- specific detector on-the-fly yielding a higher detection confidence for the target, see Fig. 1.
Abstract: In this work, we demonstrate that an off-line trained class-specific detector can be transformed into an instance-specific detector on-the-fly. To this end, we make use of a codebook-based detector [1] that is trained on an object class. Codebooks model the spatial distribution and appearance of object parts. When matching an image against a codebook, a certain set of codebook entries is activated to cast probabilistic votes for the object. For a given object hypothesis, one can collect the entries that voted for the object. In our case, these entries can be regarded as a signature for the target of interest. Since a change of pose and appearance can lead to an activation of very different codebook entries, we learn the statistics for the target and the background over time, i.e. we learn on-line the probability of each part in the codebook belonging to the target. By taking the target-specific statistics into account for voting, the target can be distinguished from other instances in the background yielding a higher detection confidence for the target, see Fig. 1. A class-specific codebook as in [1, 2, 3, 4, 5] is trained off-line to identify any instance of the class in any image. It models the probability of the patches belonging to the object class p ( c=1|L ) and the local spatial distribution of the patches with respect to the object center p ( x|c=1,L ) . For detection, patches are sampled from an image and matched against the codebook, i.e. each patch P(y) sampled from image location y ends at a leaf L(y). The probability for an instance of the class centered at the location x is then given by