scispace - formally typeset
Search or ask a question

Showing papers presented at "British Machine Vision Conference in 2011"


Proceedings ArticleDOI
01 Jan 2011
TL;DR: A novel solution to the problem of recovering and tracking the 3D position, orientation and full articulation of a human hand from markerless visual observations obtained by a Kinect sensor is presented.
Abstract: We present a novel solution to the problem of recovering and tracking the 3D position, orientation and full articulation of a human hand from markerless visual observations obtained by a Kinect sensor. We treat this as an optimization problem, seeking for the hand model parameters that minimize the discrepancy between the appearance and 3D structure of hypothesized instances of a hand model and actual hand observations. This optimization problem is effectively solved using a variant of Particle Swarm Optimization (PSO). The proposed method does not require special markers and/or a complex image acquisition setup. Being model based, it provides continuous solutions to the problem of tracking hand articulations. Extensive experiments with a prototype GPU-based implementation of the proposed method demonstrate that accurate and robust 3D tracking of hand articulations can be achieved in near real-time (15Hz).

1,009 citations


Proceedings ArticleDOI
01 Jan 2011
TL;DR: A rigorous evaluation of novel encodings for bag of visual words models by identifying both those aspects of each method which are particularly important to achieve good performance, and those aspects which are less critical, which allows a consistent comparative analysis of these encoding methods.
Abstract: A large number of novel encodings for bag of visual words models have been proposed in the past two years to improve on the standard histogram of quantized local features. Examples include locality-constrained linear encoding [23], improved Fisher encoding [17], super vector encoding [27], and kernel codebook encoding [20]. While several authors have reported very good results on the challenging PASCAL VOC classification data by means of these new techniques, differences in the feature computation and learning algorithms, missing details in the description of the methods, and different tuning of the various components, make it impossible to compare directly these methods and hard to reproduce the results reported. This paper addresses these shortcomings by carrying out a rigorous evaluation of these new techniques by: (1) fixing the other elements of the pipeline (features, learning, tuning); (2) disclosing all the implementation details, and (3) identifying both those aspects of each method which are particularly important to achieve good performance, and those aspects which are less critical. This allows a consistent comparative analysis of these encoding methods. Several conclusions drawn from our analysis cannot be inferred from the original publications.

980 citations


Proceedings ArticleDOI
01 Jan 2011
TL;DR: A novel methodology for re-identification, based on Pictorial Structures (PS), which learns the appearance of an individual, improving the localization of its parts, thus obtaining more reliable visual characteristics for re -identification.
Abstract: We propose a novel methodology for re-identification, based on Pictorial Structures (PS). Whenever face or other biometric information is missing, humans recognize an individual by selectively focusing on the body parts, looking for part-to-part correspondences. We want to take inspiration from this strategy in a re-identification context, using PS to achieve this objective. For single image re-identification, we adopt PS to localize the parts, extract and match their descriptors. When multiple images of a single individual are available, we propose a new algorithm to customize the fit of PS on that specific person, leading to what we call a Custom Pictorial Structure (CPS). CPS learns the appearance of an individual, improving the localization of its parts, thus obtaining more reliable visual characteristics for re-identification. It is based on the statistical learning of pixel attributes collected through spatio-temporal reasoning. The use of PS and CPS leads to state-of-the-art results on all the available public benchmarks, and opens a fresh new direction for research on re-identification.

692 citations


Proceedings ArticleDOI
01 Jan 2011
TL;DR: The method reconstructs highly slanted surfaces and achieves impressive disparity details with sub-pixel precision and allows for explicit treatment of occlusions and can handle large untextured regions.
Abstract: Common local stereo methods match support windows at integer-valued disparities. The implicit assumption that pixels within the support region have constant disparity does not hold for slanted surfaces and leads to a bias towards reconstructing frontoparallel surfaces. This work overcomes this bias by estimating an individual 3D plane at each pixel onto which the support region is projected. The major challenge of this approach is to find a pixel’s optimal 3D plane among all possible planes whose number is infinite. We show that an ideal algorithm to solve this problem is PatchMatch [1] that we extend to find an approximate nearest neighbor according to a plane. In addition to Patch-Match’s spatial propagation scheme, we propose (1) view propagation where planes are propagated among left and right views of the stereo pair and (2) temporal propagation where planes are propagated from preceding and consecutive frames of a video when doing temporal stereo. Adaptive support weights are used in matching cost aggregation to improve results at disparity borders. We also show that our slanted support windows can be used to compute a cost volume for global stereo methods, which allows for explicit treatment of occlusions and can handle large untextured regions. In the results we demonstrate that our method reconstructs highly slanted surfaces and achieves impressive disparity details with sub-pixel precision. In the Middlebury table, our method is currently top-performer among local methods and takes rank 2 among approximately 110 competitors if sub-pixel precision is considered.

687 citations


Proceedings ArticleDOI
01 Jan 2011
TL;DR: A novel automatic salient object segmentation algorithm which integrates both bottom-up salient stimuli and object-level shape prior, leading to binary segmentation of the salient object.
Abstract: We propose a novel automatic salient object segmentation algorithm which integrates both bottom-up salient stimuli and object-level shape prior, i.e., a salient object has a well-defined closed boundary. Our approach is formalized as an iterative energy minimization framework, leading to binary segmentation of the salient object. Such energy minimization is initialized with a saliency map which is computed through context analysis based on multi-scale superpixels. Object-level shape prior is then extracted combining saliency with object boundary information. Both saliency map and shape prior update after each iteration. Experimental results on two public benchmark datasets show that our proposed approach outperforms state-of-the-art methods.

486 citations


Proceedings ArticleDOI
01 Jan 2011
TL;DR: The hand detector exceeds the state of the art on two public datasets, including the PASCAL VOC 2010 human layout challenge and is introduced with a fully annotated hand dataset for training and testing.
Abstract: We describe a two-stage method for detecting hands and their orientation in unconstrained images. The first stage uses three complementary detectors to propose hand bounding boxes. Each bounding box is then scored by the three detectors independently, and a second stage classifier learnt to compute a final confidence score for the proposals using these features. We make the following contributions: (i) we add context-based and skin-based proposals to a sliding window shape based detector to increase recall; (ii) we develop a new method of non-maximum suppression based on super-pixels; and (iii) we introduce a fully annotated hand dataset for training and testing. We show that the hand detector exceeds the state of the art on two public datasets, including the PASCAL VOC 2010 human layout challenge.

280 citations


Proceedings ArticleDOI
01 Jan 2011
TL;DR: A novel approach for detecting social interactions in a crowded scene by employing solely visual cues and employing the sociological notion of F-formation, which is a set of possible configurations in space that people may assume while participating in a social interaction.
Abstract: We present a novel approach for detecting social interactions in a crowded scene by employing solely visual cues. The detection of social interactions in unconstrained scenarios is a valuable and important task, especially for surveillance purposes. Our proposal is inspired by the social signaling literature, and in particular it considers the sociological notion of F-formation. An F-formation is a set of possible configurations in space that people may assume while participating in a social interaction. Our system takes as input the positions of the people in a scene and their (head) orientations; then, employing a voting strategy based on the Hough transform, it recognizes F-formations and the individuals associated with them. Experiments on simulations and real data promote our idea.

222 citations


Proceedings ArticleDOI
01 Jan 2011
TL;DR: A novel framework for combining the merits of inertial and visual data from a monocular camera to accumulate estimates of local motion incrementally and reliably reconstruct the trajectory traversed is presented.
Abstract: The increasing demand for real-time high-precision Visual Odometry systems as part of navigation and localization tasks has recently been driving research towards more versatile and scalable solutions. In this paper, we present a novel framework for combining the merits of inertial and visual data from a monocular camera to accumulate estimates of local motion incrementally and reliably reconstruct the trajectory traversed. We demonstrate the robustness and efficiency of our methodology in a scenario with challenging camera dynamics, and present a comprehensive evaluation against ground-truth data.

199 citations


Proceedings ArticleDOI
01 Jan 2011
TL;DR: Comparing pose-based, appearance-based and combined pose and appearance features for action recognition in a home-monitoring scenario shows that posebased features outperform low-level appearance features, even when heavily corrupted by noise, suggesting that pose estimation is beneficial for the action recognition task.
Abstract: Early works on human action recognition focused on tracking and classifying articulated body motions. Such methods required accurate localisation of body parts, which is a difficult task, particularly under realistic imaging conditions. As such, recent trends have shifted towards the use of more abstract, low-level appearance features such as spatio-temporal interest points. Motivated by the recent progress in pose estimation, we feel that pose-based action recognition systems warrant a second look. In this paper, we address the question of whether pose estimation is useful for action recognition or if it is better to train a classifier only on low-level appearance features drawn from video data. We compare pose-based, appearance-based and combined pose and appearance features for action recognition in a home-monitoring scenario. Our experiments show that posebased features outperform low-level appearance features, even when heavily corrupted by noise, suggesting that pose estimation is beneficial for the action recognition task.

191 citations


Proceedings ArticleDOI
01 Jan 2011
TL;DR: This work empirically study material recognition of real-world objects using a rich set of local features using the Kernel Descriptor framework and extends the set of descriptors to include materialmotivated attributes using variances of gradient orientation and magnitude.
Abstract: Material recognition is a fundamental problem in perception that is receiving increasing attention. Following the recent work using Flickr [16, 23], we empirically study material recognition of real-world objects using a rich set of local features. We use the Kernel Descriptor framework [5] and extend the set of descriptors to include materialmotivated attributes using variances of gradient orientation and magnitude. Large-Margin Nearest Neighbor learning is used for a 30-fold dimension reduction. We improve the state-of-the-art accuracy on the Flickr dataset [16] from 45% to 54%. We also introduce two new datasets using ImageNet and macro photos, extensively evaluating our set of features and showing promising connections between material and object recognition.

113 citations


Proceedings ArticleDOI
01 Sep 2011
TL;DR: It is shown that video object segmentation can be naturally cast as a semi-supervised learning problem and be efficiently solved using harmonic functions and an incremental self-training approach by iteratively labeling the least uncertain frame and updating similarity metrics is proposed.
Abstract: This work addresses the problem of segmenting an object of interest out of a video. We show that video object segmentation can be naturally cast as a semi-supervised learning problem and be efficiently solved using harmonic functions. We propose an incremental self-training approach by iteratively labeling the least uncertain frame and updating similarity metrics. Our self-training video segmentation produces superior results both qualitatively and quantitatively. Moreover, usage of harmonic functions naturally supports interactive segmentation. We suggest active learning methods for providing guidance to user on what to annotate in order to improve labeling efficiency. We present experimental results using a ground truth data set and a quantitative comparison to a representative object segmentation system.

Proceedings ArticleDOI
David Pfeiffer1, Uwe Franke1
01 Jan 2011
TL;DR: This work presents a novel reconstruction of stereo vision data that allows to incorporate real-world constraints such as perspective ordering and delivers an optimal segmentation with respect to freespace and obstacle information.
Abstract: Dense 3D data as delivered by stereo vision systems, modern laser scanners or timeof-flight cameras such as PMD is a key element for 3D scene understanding. Real-time high-level vision systems require a compact and explicit representation of that data which allows for efficient attention control, object detection, and reasoning. Because man-made environments are dominated by planar horizontal and vertical surfaces we approximate the three dimensional scenery by using sets of thin planar rectangles called Stixels. This medium level representation serves as input for further processing steps and applications. Using this novel representation those are not required to process the large amounts of raw 3D data individually. This reconstruction is addressed by means of a unified probabilistic approach. Dynamic programming allows to incorporate real-world constraints such as perspective ordering and delivers an optimal segmentation with respect to freespace and obstacle information. We present results for both stereo vision data and laser data. The real-time capable approach can also be used to fuse the information of multiple data sources.

Proceedings ArticleDOI
29 Aug 2011
TL;DR: The aim of the present work is to demonstrate that for the task of image denoising, nearly state-of-the-art results can be achieved using orthogonal dictionaries only, provided that they are learned directly from the noisy image.
Abstract: In recent years, overcomplete dictionaries combined with sparse learning techniques became extremely popular in computer vision. While their usefulness is undeniable, the improvement they provide in specific tasks of computer vision is still poorly understood. The aim of the present work is to demonstrate that for the task of image denoising, nearly state-of-the-art results can be achieved using orthogonal dictionaries only, provided that they are learned directly from the noisy image. To this end, we introduce three patchbased denoising algorithms which perform hard thresholding on the coefficients of the patches in image-specific orthogonal dictionaries. The algorithms differ by the methodology of learning the dictionary: local PCA, hierarchical PCA and global PCA.We carry out a comprehensive empirical evaluation of the performance of these algorithms in terms of accuracy and running times. The results reveal that, despite its simplicity, PCA-based denoising appears to be competitive with the state-of-the-art denoising algorithms, especially for large images and moderate signal-to-noise ratios.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: A fusion formulation which integrates low- and high-dimensional tracking approaches into one framework and ensures that the overall performance of the system is improved by concentrating on the respective advantages of the two approaches and resolving their weak points.
Abstract: Tracking generic human motion is highly challenging due to its high-dimensional state space and the various motion types involved. In order to deal with these challenges, a fusion formulation which integrates low- and high-dimensional tracking approaches into one framework is proposed. The low-dimensional approach successfully overcomes the high-dimensional problem of tracking the motions with available training data by learning motion models, but it only works with specific motion types. On the other hand, although the high-dimensional approach may recover the motions without learned models by sampling directly in the pose space, it lacks robustness and efficiency. Within the framework, the two parallel approaches, low- and high-dimensional, are fused via a probabilistic approach at each time step. This probabilistic fusion approach ensures that the overall performance of the system is improved by concentrating on the respective advantages of the two approaches and resolving their weak points. The experimental results, after qualitative and quantitative comparisons, demonstrate the effectiveness of the proposed approach in tracking generic human motion.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: The descriptor MiC is shown that encodes image microscopic configuration by a linear configuration model that could avoid the generalization problem suffered by other statistical learning methods.
Abstract: Texture classification can be concluded as the problem of classifying images according to textural cues, that is, categorizing a texture image obtained under certain illumination and viewpoint condition as belonging to one of the pre-learned texture classes. Therefore, it would mainly pass through two steps: image representation or description and classification. In this paper, we focus on the feature extraction part that aims to extract effective patterns to distinguish different textures. Among various feature extraction methods, local features have performed well in real-world applications, such as LBP[4], SIFT [2] and Histogram of Oriented Gradients (HOG) [1]. Representative methods also include grey level difference or co-occurrence statistics [10], and methods based on multi-channel filtering or wavelet decomposition [3, 5, 7]. To learn representative structural configuration from texture images, Varma et al. proposed texton methods based on the filter response space and local image patch space [8, 9]. We show in this paper the descriptor MiC that encodes image microscopic configuration by a linear configuration model. The final local configuration pattern (LCP) feature integrates both the microscopic features represented by optimal model parameters and local features represented by pattern occurrences. To be specific, microscopic features capture image microscopic configuration which embodies image configuration and pixel-wise interaction relationships by a linear model. The optimal model parameters are estimated by an efficient least squares estimator. To achieve rotation invariance, which is a desired property for texture features, Fourier transform is applied to the estimated parameter vectors. Finally, the transformed vectors are concatenated with local pattern occurrences to construct LCPs. As this framework is unsupervised, it could avoid the generalization problem suffered by other statistical learning methods. To model the image configuration with respect to each pattern, we estimate optimal weights, associating with intensities of neighboring pixels, to linearly reconstruct the central pixel intensity. This can be expressed by:

Proceedings ArticleDOI
01 Sep 2011
TL;DR: This paper compares two non-linear, discriminative regression strategies for predicting shape updates, a boosting approach and variants of Random Forest regression and presents results that show that the generalisation performance of the Random Forest is superior to that of the linear or boosted regression procedure.
Abstract: Active Appearance Models (AAMs) are widely used to fit shape models to new images. Recently it has been demonstrated that non-linear regression methods and sequences of AAMs can significantly improve performance over the original linear formulation. In this paper we focus on the ability of a model trained on one dataset to generalise to other sets with different conditions. In particular we compare two non-linear, discriminative regression strategies for predicting shape updates, a boosting approach and variants of Random Forest regression. We investigate the use of these regression methods within a sequential model fitting framework, where each stage in the sequence consists of a shape model and a corresponding regression model. The performance of the framework is assessed by both testing on unseen data taken from within the training databases, as well as by investigating the more difficult task of generalising to unrelated datasets. We present results that show that (a) the generalisation performance of the Random Forest is superior to that of the linear or boosted regression procedure and that (b) using a simple feature selection procedure, the Random Forest can be made to be as efficient as the boosting procedure without significant reduction in accuracy.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: The idea is to decompose the problem into subproblems, including initial estimation under fixed head pose and subsequent compensations for estimation biases caused by head rotation and eye appearance distortion, and solve each subproblem by either learning-based method or geometric-based calculation.
Abstract: T o infer human gaze from eye appearance, various methods have been proposed. However, most of them assume a fixed head pose because allowing free head motion adds 6 degrees of freedom to the problem and requires a prohibitively large number of training samples. In this paper, we aim at solving the appearance-based gaze estimation problem under free head motion without significantly increasing the cost of training. The idea is to decompose the problem into subproblems, including initial estimation under fixed head pose and subsequent compensations for estimation biases caused by head rotation and eye appearance distortion. Then each subproblem is solved by either learning-based method or geometric-based calculation. Specifically, the gaze estimation bias caused by eye appearance distortion is learnt effectively from a 5-seconds video clip. Extensive experiments were conducted to verify the effectiveness of the proposed approach.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: Side-Information based Linear Discriminant Analysis (SILD), in which the within-class and between-class scatter matrices are directly calculated by using the side-information, is proposed, and it is theoretically prove that the SILD method is equivalent to FLDA when the full class label information is available.
Abstract: In recent years, face recognition in the unconstrained environment has attracted increasing attentions, and a few methods have been evaluated on the Labeled Faces in the Wild (LFW) database. In the unconstrained conditions, sometimes we cannot obtain the full class label information of all the subjects. Instead we can only get the weak label information, such as the side-information, i.e., the image pairs from the same or different subjects. In this scenario, many multi-class methods (e.g., the wellknown Fisher Linear Discriminant Analysis (FLDA)), fail to work due to the lack of full class label information. To effectively utilize the side-information in such case, we propose Side-Information based Linear Discriminant Analysis (SILD), in which the within-class and between-class scatter matrices are directly calculated by using the side-information. Moreover, we theoretically prove that our SILD method is equivalent to FLDA when the full class label information is available. Experiments on LFW and FRGC databases support our theoretical analysis, and SILD using multiple features also achieve promising performance when compared with the stateof-the-art methods.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: A new saliency detection model is proposed by combining global information from frequency domain analysis and local information from spatial domain analysis that has the ability to highlight both small and large salient regions in cluttered scenes and to inhibit repeating objects.
Abstract: We propose a new saliency detection model by combining global information from frequency domain analysis and local information from spatial domain analysis. In the frequency domain analysis, instead of modeling salient regions, we model the nonsalient regions using global information; these so-called repeating patterns that are not distinctive in the scene are suppressed by using spectrum smoothing. In spatial domain analysis, we enhance those regions that are more informative by using a center-surround mechanism similar to that found in the visual cortex. Finally, the outputs from these two channels are combined to produce the saliency map. We demonstrate that the proposed model has the ability to highlight both small and large salient regions in cluttered scenes and to inhibit repeating objects. Experimental results also show that the proposed model outperforms existing algorithms in predicting objects regions where human pay more attention.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: A strong improvement in the reliability of the depth estimate as well as improved performance on a object segmentation task in a table top scenario is shown.
Abstract: The introduction of the Microsoft Kinect Sensors has stirred significant interest in the robotics community. While originally developed as a gaming interface, a high quality depth sensor and affordable price have made it a popular choice for robotic perception. Its active sensing strategy is very well suited to produce robust and high-frame rate depth maps for human pose estimation. But the shift to the robotics domain surfaced applications under a wider set of operation condition it wasn’t originally designed for. We see the sensor fail completely on transparent and specular surfaces which are very common to every day household objects. As these items are of great interest in home robotics and assistive technologies, we have investigated methods to reduce and sometimes even eliminate these effects without any modification of the hardware. In particular, we complement the depth estimate within the Kinect by a cross-modal stereo path that we obtain from disparity matching between the included IR and RGB sensor of the Kinect. We investigate how the RGB channels can be combined optimally in order to mimic the image response of the IR sensor by an early fusion scheme of weighted channels as well as a late fusion scheme that computes stereo matches between the different channels independently. We show a strong improvement in the reliability of the depth estimate as well as improved performance on a object segmentation task in a table top scenario.

Proceedings ArticleDOI
29 Aug 2011
TL;DR: This paper defines a space of grids where each grid is obtained by a series of recursive axis aligned splits of cells and casts the classification problem in a maximum margin formulation with the optimization being over the weight vector and the spatial grid.
Abstract: Spatial Pyramid Representation (SPR) [7] introduces spatial layout information to the orderless bag-of-features (BoF) representation. SPR has become the standard and has been shown to perform competitively against more complex methods for incorporating spatial layout. In SPR the image is divided into regular grids. However, the grids are taken as uniform spatial partitions without any theoretical motivation. In this paper, we address this issue and propose to learn the spatial partitioning with the BoF representation. We define a space of grids where each grid is obtained by a series of recursive axis aligned splits of cells. We cast the classification problem in a maximum margin formulation with the optimization being over the weight vector and the spatial grid. In addition to experiments on two challenging public datasets (Scene-15 and Pascal VOC 2007) showing that the learnt grids consistently perform better than the SPR while being much smaller in vector length, we also introduce a new dataset of human attributes and show that the current method is well suited to the recognition of spatially localized human attributes.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: For the first time, a weakly supervised action detection method is proposed which only requires binary labels of the videos indicating the presence of the action of interest, and that the proposed MIL algorithm significantly outperforms the existing ones.
Abstract: The detection of human action in videos of busy natural scenes with dynamic background is of interest for applications such as video surveillance. Taking a conventional fully supervised approach, the spatio-temporal locations of the action of interest have to be manually annotated frame by frame in the training videos, which is tedious and unreliable. In this paper, for the first time, a weakly supervised action detection method is proposed which only requires binary labels of the videos indicating the presence of the action of interest. Given a training set of binary labelled videos, the weakly supervised learning (WSL) problem is recast as a multiple instance learning (MIL) problem. A novel MIL algorithm is developed which differs from the existing MIL algorithms in that it locates the action of interest spatially and temporally by globally optimising both interand intra-class distance. We demonstrate through experiments that our WSL approach can achieve comparable detection performance to a fully supervised learning approach, and that the proposed MIL algorithm significantly outperforms the existing ones.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: Standard cinematography practice is to first establish which characters are looking at each other using a medium or wide shot, and then edit subsequent close-up shots so that the eyelines match the point of view of the characters.
Abstract: If you read any book on film editing or listen to a director’s commentary on a DVD, then what emerges again and again is the importance of eyelines. Standard cinematography practice is to first establish which characters are looking at each other using a medium or wide shot, and then edit subsequent close-up shots so that the eyelines match the point of view of the characters. This is the basis of the well known 180o rule in editing.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: The results show that preserving the sparse representation of the signals from the original space in the (lower) dimensional projected space is beneficial for several benchmarks and corroborate the usefulness of the proposed sparse representation based linear and non-linear projections.
Abstract: In dimensionality reduction most methods aim at preserving one or a few properties of the original space in the resulting embedding. As our results show, preserving the sparse representation of the signals from the original space in the (lower) dimensional projected space is beneficial for several benchmarks (faces, traffic signs, and handwritten digits). The intuition behind is that taking a sparse representation for the different samples as point of departure highlights the important correlations among the samples that one then wants to exploit to arrive at the final, effective low-dimensional embedding. We explicitly adapt the LPP and LLE techniques to work with the sparse representation criterion and compare to the original methods on the referenced databases, and this for both unsupervised and supervised cases. The improved results corroborate the usefulness of the proposed sparse representation based linear and non-linear projections.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: This paper analyzes the calibration of parameters of the perspective pinhole camera model explicitly considering refraction to determine those parameters without the need of handling a calibration target underwater, which is cumbersome, if not impossible.
Abstract: When using perspective cameras underwater, the underwater housing with its glass interface between water and air causes the light rays to change their direction due to refraction. In applications where geometrical properties of images are exploited without explicitly modeling refraction, i.e. when using the perspective pinhole camera model, this leads to a systematical error. This error is depending on the housing configuration like distance between camera and glass interface and angle between glass interface normal and optical axis. In this paper, we analyze the calibration of those parameters using a camera model explicitly considering refraction. The goal is to determine those parameters without the need of handling a calibration target underwater, which is cumbersome, if not impossible.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: This work proposes a new method to quickly and robustly estimate the 3D pose of the human skeleton from volumetric body scans without the need for visual markers and provides an extensive qualitative and quantitative evaluation of the method.
Abstract: We propose a new method to quickly and robustly estimate the 3D pose of the human skeleton from volumetric body scans without the need for visual markers. The core principle of our algorithm is to apply a fast center-line extraction to 3D voxel data and robustly fit a skeleton model to the resulting graph. Our algorithm allows for automatic, single-frame initialization and tracking of the human pose while being fast enough for real-time applications at up to 30 frames per second. We provide an extensive qualitative and quantitative evaluation of our method on real and synthetic datasets which demonstrates the stability of our algorithm even when applied to long motion sequences.

Proceedings ArticleDOI
30 Aug 2011
TL;DR: A new principled way to compare the dynamics and temporal structure of actions by computing the distance between their auto-correlations is provided and a practical formulation to compute this distance in any feature space deriving from a base kernel between frames is derived.
Abstract: We address the problem of action recognition by describing actions as time series of frames and introduce a new kernel to compare their dynamical aspects. Action recognition in realistic videos has been successfully addressed using kernel methods like SVMs. Most existing approaches average local features over video volumes and compare the resulting vectors using kernels on bags of features. In contrast, we model actions as time series of per-frame representations and propose a kernel specifically tailored for the purpose of action recognition. Our main contributions are the following: (i) we provide a new principled way to compare the dynamics and temporal structure of actions by computing the distance between their auto-correlations, (ii) we derive a practical formulation to compute this distance in any feature space deriving from a base kernel between frames and (iii) we report experimental results on recent action recognition datasets showing that it provides useful complementary information to the average distribution of frames, as used in state-of-the-art models based on bag-of-features.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: This work proposes and investigates the use of different optimisation methods to search for the best patches and their respective transformations for producing consistent, improved completions in photo-editing.
Abstract: Image completion is an important photo-editing task which involves synthetically filling a hole in the image such that the image still appears natural. State-of-the-art image completion methods work by searching for patches in the image that fit well in the hole region. Our key insight is that image patches remain natural under a variety of transformations (such as scale, rotation and brightness change), and it is important to exploit this. We propose and investigate the use of different optimisation methods to search for the best patches and their respective transformations for producing consistent, improved completions. Experiments on a number of challenging problem instances demonstrate that our methods outperform state-of-the-art techniques.

Proceedings ArticleDOI
01 Jan 2011
TL;DR: The proposed models of top-down visual guidance considering task influences estimate the state of a human subject performing a task, and map that state to an eye position, and use Bayesian Networks to combine all the multi-modal factors in a unified framework.
Abstract: Modeling how visual saliency guides the deployment of attention over visual scenes has attracted much interest recently — among both computer vision and experimental/computational researchers — since visual attention is a key function of both machine and biological vision systems. Research efforts in computer vision have mostly been focused on modeling bottom-up saliency. Strong influences o n attention and eye movements, however, come from instantaneous task demands. Here, we propose models of top-down visual guidance considering task influences. The n ew models estimate the state of a human subject performing a task (here, playing video games), and map that state to an eye position. Factors influencing state come from scene gi st, physical actions, events, and bottom-up saliency. Proposed models fall into two categories. In the first category, we use classical discriminative classifiers, including Reg ression, kNN and SVM. In the second category, we use Bayesian Networks to combine all the multi-modal factors in a unified framework. Our approaches significantly outperfor m 15 competing bottom-up and top-down attention models in predicting future eye fixat ions on 18,000 and 75,00 video frames and eye movement samples from a driving and a flig ht combat video game, respectively. We further test and validate our approaches on 1.4M video frames and 11M fixations samples and in all cases obtain higher prediction s cores that reference models.