scispace - formally typeset
Search or ask a question

Showing papers presented at "British Machine Vision Conference in 2008"


Proceedings ArticleDOI
01 Sep 2008
TL;DR: This work presents a novel local descriptor for video sequences based on histograms of oriented 3D spatio-temporal gradients based on regular polyhedrons which outperform the state-of-the-art.
Abstract: In this work, we present a novel local descriptor for video sequences. The proposed descriptor is based on histograms of oriented 3D spatio-temporal gradients. Our contribution is four-fold. (i) To compute 3D gradients for arbitrary scales, we develop a memory-efficient algorithm based on integral videos. (ii) We propose a generic 3D orientation quantization which is based on regular polyhedrons. (iii) We perform an in-depth evaluation of all descriptor parameters and optimize them for action recognition. (iv) We apply our descriptor to various action datasets (KTH, Weizmann, Hollywood) and show that we outperform the state-of-the-art.

2,016 citations


Proceedings ArticleDOI
01 Jan 2008
TL;DR: This paper proposes two novel image similarity measures for fast indexing via locality sensitive hashing and an efficient way of exploiting more sophisticated similarity measures that have proven to be essential in image / particular object retrieval.
Abstract: This paper proposes two novel image similarity measures for fast indexing via locality sensitive hashing. The similarity measures are applied and evaluated in the context of near duplicate image detection. The proposed method uses a visual vocabulary of vector quantized local feature descriptors (SIFT) and for retrieval exploits enhanced min-Hash techniques. Standard min-Hash uses an approximate set intersection between document descriptors was used as a similarity measure. We propose an efficient way of exploiting more sophisticated similarity measures that have proven to be essential in image / particular object retrieval. The proposed similarity measures do not require extra computational effort compared to the original measure. We focus primarily on scalability to very large image and video databases, where fast query processing is necessary. The method requires only a small amount of data need be stored for each image. We demonstrate our method on the TrecVid 2006 data set which contains approximately 146K key frames, and also on challenging the University of Kentucky image retrieval database.

515 citations


Proceedings ArticleDOI
01 Jan 2008
TL;DR: It is shown that the ability of Random Forests to combine multiple features leads to a further increase in performance when textons, colour, filterbanks, and HOG features are used simultaneously.
Abstract: This work investigates the use of Random Forests for class based pixel-wise segmentation of images. The contribution of this paper is three-fold. First, we show that apparently quite dissimilar classifiers (such as nearest neighbour matching to texton class histograms) can be mapped onto a Random Forest architecture. Second, based on this insight, we show that the performance of such classifiers can be improved by incorporating the spatial context and discriminative learning that arises naturally in the Random Forest framework. Finally, we show that the ability of Random Forests to combine multiple features leads to a further increase in performance when textons, colour, filterbanks, and HOG features are used simultaneously. The benefit of the multi-feature classifier is demonstrated with extensive experimentation on existing labelled image datasets. The method equals or exceeds the state of the art on these datasets.

257 citations


Proceedings ArticleDOI
04 Sep 2008
TL;DR: A unified method for recovering from tracking failure and closing loops in real time monocular simultaneous localisation and mapping, and a bag-of-words appearance model for ranking potential loop closures and a robust method for using both structure and image appearance to confirm likely matches.
Abstract: We present a unified method for recovering from tracking fail ure and closing loops in real time monocular simultaneous localisation and mapping. Within a graph-based map representation, we show that recovery and loop closing both reduce to the creation of a graph edge. We describe and implement a bag-of-words appearance model for ranking potential loop closures, and a robust method for using both structure and image appearance to confirm likely matches. The resulting system closes loops and recovers from failures while mapping thousands of landmarks, all in real time.

180 citations


Proceedings ArticleDOI
01 Jan 2008
TL;DR: A Cumulative Brightness Transfer Function (CBTF) is proposed for mapping colour between cameras located at different physical sites, which makes better use of the available colour information from a very sparse training set and a bi-directional mapping approach is developed to obtain a more accurate similarity measure between a pair of candidate objects.
Abstract: The appearance of individuals captured by multiple non-overlapping cameras varies greatly due to pose and illumination changes between camera views. In this paper we address the problem of dealing with illumination changes in order to recover matching of individuals appearing at different camera sites. This task is challenging as accurately mapping colour changes between views requires an exhaustive set of corresponding chromatic brightness values to be collected, which is very difficult in real world scenarios. We propose a Cumulative Brightness Transfer Function (CBTF) for mapping colour between cameras located at different physical sites, which makes better use of the available colour information from a very sparse training set. In addition we develop a bi-directional mapping approach to obtain a more accurate similarity measure between a pair of candidate objects. We evaluate the proposed method using challenging datasets obtained from real world distributed CCTV camera networks. The results demonstrate that our bi-directional CBTF method significantly outperforms existing techniques.

172 citations


Proceedings ArticleDOI
01 Jan 2008
TL;DR: The method uses dynamic texture descriptors to describe human movements in a spatiotemporal way and works on image data rather than silhouettes, following recent trends in computer vision research.
Abstract: We present a novel approach for human activity reco gnition. The method uses dynamic texture descriptors to describe human movements in a spatiotemporal way. The same features are also use d for human detection, which makes our whole approach computationally simple. Following recent trends in computer vision research , our method works on image data rather than silhouettes. We test our met hod on a publicly available dataset and compare our result to the sta te of the art methods.

170 citations


Proceedings ArticleDOI
01 Jan 2008
TL;DR: A novel and fast algorithm to solve the Perspective-n-Point problem, which can be estimated in less than 0.15 seconds for 100 points, using a formulation for general camera models.
Abstract: We present a novel and fast algorithm to solve the Perspective-n-Point problem. The PnP problem estimating the pose of a calibrated camera based on measurements and known 3D scene, is recasted as a minimization problem of the Object Space Cost. Instead of limiting the algorithm to perspective cameras, we use a formulation for general camera models. The minimization problem, together with a quaternion based representation of the rotation, is transferred into a semi definite positive program (SDP). This transfer is done in O(n) time and leads to an SDP of constant size. The solution of the SDP is a global minimizer of the PnP problem, which can be estimated in less than 0.15 seconds for 100 points.

149 citations


Proceedings ArticleDOI
01 Jan 2008
TL;DR: A number of symmetry models are developed an performed an eye tracking study with human participants viewing photographic images to test the models and the results show that the symmetry models better match the human data than the contrast model.
Abstract: Humans are very sensitive to symmetry in visual patterns. Symmetry is detected and recognized very rapidly. While viewing symmetrical patterns eye fixations are concentrated along the axis of symmetry or the symmetrical center of the patterns. This suggests that symmetry is a highly salient feature. Existing computational models of saliency, however, have mainly focused on contrast as a measure of saliency. These models do not take symmetry into account. In this paper, we discuss local symmetry as measure of saliency. We developed a number of symmetry models an performed an eye tracking study with human participants viewing photographic images to test the models. The performance of our symmetry models is compared with the contrast saliency model of Itti et al. [1]. The results show that the symmetry models better match the human data than the contrast model. This indicates that symmetry is a salient structural feature for humans, a finding which can be exploited in computer vision.

148 citations


Proceedings ArticleDOI
01 Jan 2008
TL;DR: A two step segmentation algorithm is presented that first obtains a binary segmentation and then applies matting on the border regions to obtain a smooth alpha channel and a novel matting approach, based on energy minimization, is presented.
Abstract: Interactive object extraction is an important part in any image editing software. We present a two step segmentation algorithm that first obtains a binary segmentation and then applies matting on the border regions to obtain a smooth alpha channel. The proposed segmentation algorithm is based on the minimization of the Geodesic Active Contour energy. A fast Total Variation minimization algorithm is used to find the globally optimal solution. We show how user interaction can be incorporated and outline an efficient way to exploit color information. A novel matting approach, based on energy minimization, is presented. Experimental evaluations are discussed, and the algorithm is compared to state of the art object extraction algorithms. The GPU based binaries are available online.

144 citations


Proceedings ArticleDOI
01 Sep 2008
TL;DR: The goal of this work is to detect hand and arm positions over continuous sign language video sequences of more than one hour in length and it is shown that the method is able to identify the true arm and hand locations.
Abstract: The goal of this work is to detect hand and arm positions over continuous sign language video sequences of more than one hour in length. We cast the problem as inference in a generative model of the image. Under this model, limb detection is expensive due to the very large number of possible configurations each part can assume. We make the following contributions to reduce this cost: (i) using efficient sampling from a pictorial structure proposal distribution to obtain reasonable configurations; (ii) identifying a large set of frames where correct configurations can be inferred, and using temporal tracking elsewhere. Results are reported for signing footage with changing background, challenging image conditions, and different signers; and we show that the method is able to identify the true arm and hand locations. The results exceed the state-of-the-art for the length and stability of continuous limb tracking.

137 citations


Proceedings ArticleDOI
01 Jan 2008
TL;DR: An incremental semi-supervised one-class learning procedure in which unlabelled trajectories are combined with occasional examples of normal behaviour labelled by a human operator is found to be effective on two different datasets, indicating that a human operators could potentially train the system to detect anomalous behaviour by providing only occasional interventions.
Abstract: A novel learning framework is proposed for anomalous behaviour detection in a video surveillance scenario, so that a classifier which distinguishes between normal and anomalous behaviour patterns can be incrementally trained with the assistance of a human operator. We consider the behaviour of pedestrians in terms of motion trajectories, and parametrise these trajectories using the control points of approximating cubic spline curves. This paper demonstrates an incremental semi-supervised one-class learning procedure in which unlabelled trajectories are combined with occasional examples of normal behaviour labelled by a human operator. This procedure is found to be effective on two different datasets, indicating that a human operator could potentially train the system to detect anomalous behaviour by providing only occasional interventions (a small percentage of the total number of observations).

Proceedings ArticleDOI
01 Jan 2008
TL;DR: This paper shows that the five-point relative pose problem and the six-point focal length problem can easily be formulated as polynomial eigenvalue problems of degree three and two and solved using standard efficient numerical algorithms.
Abstract: In this paper we provide new fast and simple solutions to two important minimal problems in computer vision, the five-point relative pose problem and the six-point focal length problem. We show that these two problems can easily be formulated as polynomial eigenvalue problems of degree three and two and solved using standard efficient numerical algorithms. Our solutions are somewhat more stable than state-of-the-art solutions by Nister and Stewenius and are in some sense more straightforward and easier to implement since polynomial eigenvalue problems are well studied with many efficient and robust algorithms available. The quality of the solvers is demonstrated in experiments 1.

Proceedings ArticleDOI
01 Jan 2008
TL;DR: This paper addresses the problem of extracting an alpha matte from a single photograph given a user-defined trimap with novel ideas for each color modeling step and shows that its approach considerably improves over state-of-the-art techniques by evaluating it on a large database of 54 images with known high-quality ground truth.
Abstract: This paper addresses the problem of extracting an alpha matte from a single photograph given a user-defined trimap. A crucial part of this task is the color modeling step where for each pixel the optimal alpha value, together with its confidence, is estimated individually. This forms the data term of the objective function. It comprises of three steps: (i) Collecting a candidate set of potential foreand background colors; (ii) Selecting high confidence samples from the candidate set; (iii) Estimating a sparsity prior to remove blurry artifacts. We introduce novel ideas for each of these steps and show that our approach considerably improves over state-of-the-art techniques by evaluating it on a large database of 54 images with known high-quality ground truth.

Proceedings ArticleDOI
01 Jan 2008
TL;DR: A higher order scheme is presented for the optimal correction method of Kanatani for triangulation from two views and is compared with the method of Hartley and Sturm, finding the proposed method is significantly faster.
Abstract: A higher order scheme is presented for the optimal correction method of Kanatani [5] for triangulation from two views and is compared with the method of Hartley and Sturm [3]. It is pointed out that the epipole is a singularity of the Hartley-Sturm method, while the proposed method has no singularity. Numerical simulation confirms that both compute identical solutions at other points. However, the proposed method is significantly faster.

Proceedings ArticleDOI
Caifeng Shan, Tommaso Gritti1
01 Jan 2008
TL;DR: This paper proposes to learn discriminative LBP-Histogram (LBPH) bins for the task of facial expression recognition, and experimentally illustrates that it is necessary to consider multiscale LBP for representing faces, and most discrim inative information is contained in uniform patterns.
Abstract: Local Binary Patterns (LBP) have been well exploited for facial image analysis recently. In the existing work, the LBP histograms are extracted from local facial regions, and used as a whole for the regional description. However, not all bins in the LBP histogram are necessary to be useful for facial representation. In this paper, we propose to learn discriminative LBP-Histogram (LBPH) bins for the task of facial expression recognition. Our experiments illustrate that the selected LBPH bins provide a compact and discriminative facial representation. We experimentally illustrate that it is necessary to consider multiscale LBP for representing faces, and most discriminative information is contained in uniform patterns. By adopting SVM with the selected multiscale LBPH bins, we obtain the best recognition performance of 93.1% on the Cohn-Kanade database.

Proceedings ArticleDOI
01 Jan 2008
TL;DR: A semantic scene segmentation model is proposed to decompose a wide-area scene into regions where behaviours share similar characteristic and are represented as classes of video events bearing similar features to infer global behaviour patterns.
Abstract: We present a novel framework for inferring global behaviour patterns through modelling behaviour correlations in a wide-area scene and detecting any anomaly in behaviours occurring both locally and globally. Specifically, we propose a semantic scene segmentation model to decompose a wide-area scene into regions where behaviours share similar characteristic and are represented as classes of video events bearing similar features. To model behavioural correlations globally, we investigate both a probabilistic Latent Semantic Analysis (pLSA) model and a two-stage hierarchical pLSA model for global behaviour inference and anomaly detection. The proposed framework is validated by experiments using complex crowded outdoor scenes.

Proceedings ArticleDOI
01 Jan 2008
TL;DR: An approach to human action recognition via local feature tracking and robust estimation of background motion through a robust feature extraction algorithm based on KLT tracker and SIFT as well as a method for estimating dominant planes in the scene.
Abstract: This paper discusses an approach to human action recognition via local feature tracking and robust estimation of background motion. The main contribution is a robust feature extraction algorithm based on KLT tracker and SIFT as well as a method for estimating dominant planes in the scene. Multiple interest point detectors are used to provide large number of features for every frame. The motion vectors for the features are estimated using optical flow and SIFT based matching. The features are combined with image segmentation to estimate dominant homographies, and then separated into static and moving ones regardless the camera motion. The action recognition approach can handle camera motion, zoom, human appearance variations, background clutter and occlusion. The motion compensation shows very good accuracy on a number of test sequences. The recognition system is extensively compared to state-of-the art action recognition methods and the results are improved.

Proceedings ArticleDOI
01 Jan 2008
TL;DR: A new general sampling strategy ”quasi-random weighted sampling + trimming” (QWS+) that includes well established strategies as special cases that minimizes the variance of hypothesis error estimate and leads to significant improvement in performance compared to standard sampling techniques.
Abstract: This paper addresses the problem of learning from very large databases where batch learning is impractical or even infeasible. Bootstrap is a popular technique applicable in such situations. We show that sampling strategy used for bootstrapping has a significant impact on the resulting classifier performance. We design a new general sampling strategy ”quasi-random weighted sampling + trimming” (QWS+) that includes well established strategies as special cases. The QWS+ approach minimizes the variance of hypothesis error estimate and leads to significant improvement in performance compared to standard sampling techniques. The superior performance is demonstrated on several problems including profile and frontal face detection.

Proceedings ArticleDOI
01 Jan 2008
TL;DR: An algorithm for automatic parameter selection for graph cut segmentation, designed and developed using AdaBoost, and a new way to normalize feature weights for the AdaBoost based classifier which is particularly suitable for the framework.
Abstract: The graph cut based approach has become very popular for interactive segmentation of the object of interest from the background. One of the most important and yet largely unsolved issues in the graph cut segmentation framework is parameter selection. Parameters are usually fixed beforehand by the developer of the algorithm. There is no single setting of parameters, however, that will result in the best possible segmentation for any general image. Usually each image has its own optimal set of parameters. If segmentation of an image is not as desired under the current setting of parameters, the user can always perform more interaction until the desired results are achieved. However, significant interaction may be required if parameter settings are far from optimal. In this paper, we develop an algorithm for automatic parameter selection. We design a measure of segmentation quality based on different features of segmentation that are combined using AdaBoost. Then we run the graph cut segmentation algorithm for different parameter values and choose the segmentation of highest quality according to our learnt measure. We develop a new way to normalize feature weights for the AdaBoost based classifier which is particularly suitable for our framework. Experimental results show a success rate of 95.6% for parameter selection.

Proceedings ArticleDOI
01 Sep 2008
TL;DR: This article proposes a variational multi-view stereo vision method based on meshes for recovering 3D scenes (shape and radiance) from images that minimizes the reprojection error and proposes an original modification of the Lambertian model to take into account deviations from the constant brightness assumption.
Abstract: This article proposes a variational multi-view stereo vision method based on meshes for recovering 3D scenes (shape and radiance) from images. Our method is based on generative models and minimizes the reprojection error (difference between the observed images and the images synthesized from the reconstruction). Our contributions are twofold. 1) For the first time, we rigorously compute the gradient of the reprojection error for non smooth surfaces defined by discrete triangular meshes. The gradient correctly takes into account the visibility changes that occur when a surface moves; this forces the contours generated by the reconstructed surface to perfectly match with the apparent contours in the input images. 2) We propose an original modification of the Lambertian model to take into account deviations from the constant brightness assumption without explicitly modelling the reflectance properties of the scene or other photometric phenomena involved by the camera model. Our method is thus able to recover the shape and the diffuse radiance of non Lambertian scenes.

Proceedings ArticleDOI
01 Jan 2008
TL;DR: The full method provides interpolated images with a ”natural” appearance that do not present the artifacts affecting linear and nonlinear methods.
Abstract: In this paper we describe a novel general purpose image interpolation method based on the combination of two different procedures. First, an adaptive algorithm is applied interpolating locally pixel values along the direction where second order image derivative is lower. Then interpolated values are modified using an iterative refinement minimizing differences in second order image derivatives, maximizing second order derivative values and smoothing isolevel curves. The first algorithm itself provides edge preserving images that are measurably better than those obtained with similarly fast methods presented in the literature. The full method provides interpolated images with a ”natural” appearance that do not present the artifacts affecting linear and nonlinear methods. Objective and subjective tests on a wide series of natural images clearly show the advantages of the proposed technique over existing approaches.

Proceedings ArticleDOI
01 Jan 2008
TL;DR: This work segments the image via a novel real-time color segmentation algorithm; it subsequently fit planes to textureless segments and refine them using consistency constraints to improve the quality of the stereo algorithm.
Abstract: Several real-time/near real-time stereo algorithms can currently provide accurate 3D reconstructions for well-textured scenes. However, most of these fail in sufficiently large regions that are weakly textured. Conversely, other scene reconstruction algorithms assume strong planarity in the environment. Such approaches can handle lack of texture, but tend to force nonplanar objects onto planes. We propose a compromise approach that prefers stereo depth estimates but can replace estimates in textureless regions with planes in a principled manner at near real-time rates. Our approach segments the image via a novel real-time color segmentation algorithm; we subsequently fit planes to textureless segments and refine them using consistency constraints. To further improve the quality of our stereo algorithm, we optionally employ loopy belief propagation to correct local errors.

Proceedings ArticleDOI
01 Jan 2008
TL;DR: This paper presents an approach for human action recognition by finding the discriminative key frames from a video sequence and representing them with the distribution of local motion features and their spatiotemporal arrangements.
Abstract: This paper presents an approach for human action recognition by finding the discriminative key frames from a video sequence and representing them with the distribution of local motion features and their spatiotemporal arrangements. In this approach, the key frames of the video sequence are selected by their discriminative power and represented by the local motion features detected in them and integrated from their temporal neighbors. In the key frame’s representation, the spatial arrangements of the motion features are captured in a hierarchical spatial pyramid structure. By using frame by frame voting for the recognition, experiments have demonstrated improved performances over most of the other known methods on the popular benchmark data sets. Recognizinghumanactionfromimagesequencesis an appealingyet challengingproblem in computer vision with many applications including motion capture, human-computer interaction, environment control, and security surveillance. In this paper, we focus on recognizing the activities of a person in an image sequence from local motion features and their spatiotemporal arrangements. Our approach is motivated by the recent success of “bag-of-words” model for general object recognition in computer vision[21, 14]. This representation, which is adapted from the text retrieval literature, models the object by the distribution of words from a fixed visual code book, which is usually obtained by vector quantization of local image visual features. However, this method discards the spatial and the temporal relations among these visual features, which could be helpful in object recognition. Addressing this problem, our approach uses a hierarchical representation for the key frames of a given video sequence to integrate information from both the spatial and the temporal domains. We first apply a spatiotemporal feature detector to the video sequence and obtain the local motion features. Then we generate a visual word code book by quantization of the local motion features and assign a word label to each of them. Next we select key frames of the video sequence by their discriminative power. Then, for each key frame, we integrate the visual words from its nearby frames, divide the key frame spatially into finer subdivisions and compute in each cell the histograms of the visual words detected in this key frame and its temporal neighbors. Finally, we concatenate the histograms from all cells and use

Proceedings ArticleDOI
01 Jan 2008
TL;DR: A generalisation of ICP to articulated structures, which preserves all the properties of the original algorithm, is presented, which reduces the residual registration error by a factor of 2.
Abstract: The ICP algorithm has been extensively used in computer vision for registration and tracking purposes. The original formulation of this method is restricted to the use of non-articulated models. A straightforward generalisation to articulated structures is achievable through the joint minimisation of all the structure pose parameters, for example using Levenberg-Marquardt (LM) optimisation. However, in this approach the aligning transformation cannot be estimated in closed form, like in the original ICP, and the approach heavily suffers from local minima. To overcome this limitation, some authors have extended the straightforward generalisation at the cost of giving up some of the properties of ICP. In this paper, we present a generalisation of ICP to articulated structures, which preserves all the properties of the original algorithm. The key idea is to divide the articulated body into parts, which can be aligned rigidly in the way of the original ICP, with additional constraints to keep the articulated structure intact. Experiments show that our method reduces the residual registration error by a factor of 2.

Proceedings ArticleDOI
01 Jan 2008
TL;DR: A simple approach to semantic image segmentation that scores low-level patches according to their class re levance, propagates these posterior probabilities to pixels and uses low- level segmentation to guide the semantic segmentation.
Abstract: We propose a simple approach to semantic image segmentation. Our system scores low-level patches according to their class re levance, propagates these posterior probabilities to pixels and uses low- level segmentation to guide the semantic segmentation. The two main contributions of this paper are as follows. First, for the patch scoring, we describe each patch with a high-level descriptor based on the Fisher kernel and use a s et of linear classifiers. While the Fisher kernel methodology was shown t o lead to high accuracy for image classification, it has not been applied to the segmentation problem. Second, we use global image classifiers to take into account the context of the objects to be segmented. If an image as a whole is unlikely to contain an object class, then the corresponding class is not considered in the segmentation pipeline. This increases the classification a ccuracy and reduces the computational cost. We will show that despite its apparent simplicity, this system provides above state-of-the-art performance on the PASCAL VOC 2007 dataset and state-of-the-art performance on the MSRC 21 dataset.

Proceedings ArticleDOI
01 Jan 2008
TL;DR: This work considers multi-target tracking via probabilistic data association among tracklets (trajectory fragments), a mid-level representation that provides good spatio-temporal context for efficient tracking.
Abstract: We consider multi-target tracking via probabilistic data association among tracklets (trajectory fragments), a mid-level representation that provides good spatio-temporal context for efficient tracking. Model parameter estimation and the search for the best association among tracklets are unified naturally within a Markov Chain Monte Carlo sampling procedure. The proposed approach is able to infer the optimal model parameters for different tracking scenarios in an unsupervised manner.

Proceedings ArticleDOI
01 Jan 2008
TL;DR: Results from video sequences demonstrate that an improved posterior estimation using learnt colour distributions reduces classification error and provides accurate pose information in images where the head occupies as little as 10 pixels square.
Abstract: This paper presents an algorithm for the classification of head pose in low resolution video. Invariance to skin, hair and background colours is achieved by classifying using an ensemble of randomised ferns which have been trained on labelled images. The ferns are used to simultaneously classify the head pose and to identify the most likely hypothesis for the mapping between colours and labels. Results from video sequences demonstrate that an improved posterior estimation using learnt colour distributions reduces classification error and provides accurate pose information in images where the head occupies as little as 10 pixels square. In systems which automatically monitor surveillance video, knowledge of head pose provides an important cue for higher level behavioural analysis. The focus of an individual’s attention often indicates their desired destination whereas mutual attention between people indicates familiarity, and any single object or person receiving attention from a large number of people is likely to be worthy of further investigation. In systems controlling dynamic cameras, a pose estimation from a low resolution head image can be used to determine whether or not a close-up from a dynamic camera would provide a face image that is suitable for identification. Surveillance cameras tend to have a fairly wide field of view, making the region of the video that is occupied by a person’s head fairly small. The low resolution of the head image prevents the application of techniques which require detail such as those which track feature points or detect facial features [6, 4]. The majority of research into head pose measurement in low resolution video involves the use of labelled training examples which are used to train various types of classifiers such as neural networks [11, 2, 13], support vector machines [14] or nearest neighbour and tree based classifiers [10, 7, 1]. Other approaches model the head as an ellipsoid and either learn a texture from training data [15] or fit a reprojected head image to find a relative rotation [9]. For a head pose classifier to be effective in real-world situations it must be able to cope with different skin and hair colours as well as wide variations in lighting direction, intensity and colour. Most existing classifiers are susceptible to these variations and require examples with different combinations of lighting conditions and skin/hair colour variations in order to make an accurate classification.

Proceedings ArticleDOI
01 Jan 2008
TL;DR: The method exploits a pyramid of sliding windows and quantifies how “crowd-like” each level of the pyramid is using an underlying statistical model based on quantized SIFT features, describing the degree of crowd-like appearance around an image location as the surrounding spatial extent is increased.
Abstract: The analysis of human crowds has widespread uses from law enforcement to urban engineering and traffic management. All of these require a crowd to first be detected, which is the problem addressed in this paper. Given an image, the algorithm we propose segments it into crowd and non-crowd regions. The main idea is to capture two key properties of crowds: (i) on a narrow scale, its basic element should look like a human (only weakly so, due to low resolution, occlusion, clothing variation etc.), while (ii) on a larger scale, a crowd inherently contains repetitive appearance elements. Our method exploits this by building a pyramid of sliding windows and quantifying how “crowd-like” each level of the pyramid is using an underlying statistical model based on quantized SIFT features. The two aforementioned crowd properties are captured by the resulting feature vector of window responses, describing the degree of crowd-like appearance around an image location as the surrounding spatial extent is increased.

Proceedings ArticleDOI
04 Dec 2008
TL;DR: An efficient fusion of contour and texture cues for image categorization and object detection is proposed and the synergy of the two feature types performs significantly better than either alone or alone, and that computational efficiency is substantially improved using the feature selection mechanism.
Abstract: This paper proposes an efficient fusion of contour and texture cues for image categorization and object detection. Our work confirms and strengthens recent results that combining complementary feature types improves performance. We obtain a similar improvement in accuracy and additionally an improvement in efficiency. We use a boosting algorithm to learn models that use contour and texture features. Our main contributions are (i) the use of dense generic texture features to complement contour fragments, and (ii) a simple feature selection mechanism that includes the computational costs of features in order to learn a run-time efficient model. Our evaluation on 17 challenging and varied object classes confirms that the synergy of the two feature types performs significantly better than either alone, and that computational efficiency is substantially improved using our feature selection mechanism. An investigation of the boosted features shows a fascinating emergent property: the absence of certain textures often contributes towards object detection. Comparison with recent work shows that performance is state of the art.

Proceedings ArticleDOI
01 Jan 2008
TL;DR: This work presents a method to find the subframe-accurate time offset between two or more recorded video sequences recorded by unsynchronised, non-stationary cameras, and addresses the problem of identifying the time relation between recorded sequences without the need to invade the scene to use special cameras.
Abstract: This paper studies the problem of estimating the sub-frame temporal offset between unsynchronised, non-stationary cameras. Based on motion trajectory correspondences, the estimation is done in two steps. First, we propose an algorithm to robustly estimate the frame accurate offset by analysing the trajectories and matching their characteristic time patterns. Using this result, we then show how the estimation of the fundamental matrix between two cameras can be reformulated to yield the sub-frame accurate offset from nine correspondences. We verify the robustness and performance of our approach on synthetic data as well as on real video sequences. In this work we present a method to find the subframe-accurate time offset between two or more recorded video sequences recorded by unsynchronised, non-stationary cameras. We address the problem of identifying the time relation between recorded sequences without the need to invade the scene to use special cameras. Main application of the method is multi-view video acquisition. Most multi-view processing algorithms rely on the assumption that the video sequences are temporally synchronised e. g. stereo vision, visual hull estimation and viewpoint interpolation algorithms. Synchronicity can be achieved by hardware synchronisation of the recording cameras. While this is feasible for laboratory or studio situations, it reduces the applicability of these methods in outdoor environments. Computing the subframe-accurate time offset between unsynchronised non-stationary cameras is necessary to apply multi-video algorithms to a wider range of scenes. Our method is based on tracking feature points and the resulting trajectories. It is divided into two steps. First, we find the time offset up to per-frame accuracy by extracting salient points of trajectories and matching their time patterns. This is possible already with only one single trajectory. Camera viewing angle differences of up to 90 degrees can be handled, as long as the tracked feature points are visible in both sequences. Using this coarse alignment, we can reformulate the estimation of the fundamental matrix to directly find the time offset of the non-stationary cameras to subframe accuracy. The remainder of the paper is organised as follows: The next section summarises previous work. Section 3 formalises the problem, in Section 4 our approach to compute the per-frame accurate time shift is presented. In Section 5 we describe how sub-frame accuracy is achieved. Experiments and results on synthetic data with ground truth and real world sequences are presented in Section 6.