scispace - formally typeset
Search or ask a question

Showing papers presented at "British Machine Vision Conference in 2014"


Proceedings ArticleDOI
14 May 2014
TL;DR: It is shown that the data augmentation techniques commonly applied to CNN-based methods can also be applied to shallow methods, and result in an analogous performance boost, and it is identified that the dimensionality of the CNN output layer can be reduced significantly without having an adverse effect on performance.
Abstract: The latest generation of Convolutional Neural Networks (CNN) have achieved impressive results in challenging benchmarks on image recognition and object detection, significantly raising the interest of the community in these methods. Nevertheless, it is still unclear how different CNN methods compare with each other and with previous state-of-the-art shallow representations such as the Bag-of-Visual-Words and the Improved Fisher Vector. This paper conducts a rigorous evaluation of these new techniques, exploring different deep architectures and comparing them on a common ground, identifying and disclosing important implementation details. We identify several useful properties of CNN-based representations, including the fact that the dimensionality of the CNN output layer can be reduced significantly without having an adverse effect on performance. We also identify aspects of deep and shallow methods that can be successfully shared. In particular, we show that the data augmentation techniques commonly applied to CNN-based methods can also be applied to shallow methods, and result in an analogous performance boost. Source code and models to reproduce the experiments in the paper is made publicly available.

3,533 citations


Proceedings ArticleDOI
01 Jan 2014
TL;DR: This paper presents a novel approach to robust scale estimation that can handle large scale variations in complex image sequences and shows promising results in terms of accuracy and efficiency.
Abstract: Robust scale estimation is a challenging problem in visual object tracking. Most existing methods fail to handle large scale variations in complex image sequences. This paper presents a novel appro ...

2,038 citations


Proceedings ArticleDOI
15 May 2014
TL;DR: Two simple schemes for drastically speeding up convolutional neural networks are presented, achieved by exploiting cross-channel or filter redundancy to construct a low rank basis of filters that are rank-1 in the spatial domain.
Abstract: The focus of this paper is speeding up the application of convolutional neural networks. While delivering impressive results across a range of computer vision and machine learning tasks, these networks are computationally demanding, limiting their deployability. Convolutional layers generally consume the bulk of the processing time, and so in this work we present two simple schemes for drastically speeding up these layers. This is achieved by exploiting cross-channel or filter redundancy to construct a low rank basis of filters that are rank-1 in the spatial domain. Our methods are architecture agnostic, and can be easily applied to existing CPU and GPU convolutional frameworks for tuneable speedup performance. We demonstrate this with a real world network designed for scene text character recognition [15], showing a possible 2.5× speedup with no loss in accuracy, and 4.5× speedup with less than 1% drop in accuracy, still achieving state-of-the-art on standard benchmarks.

1,159 citations


Proceedings ArticleDOI
01 Jan 2014
TL;DR: An approach to predicting style of images, and a thorough evaluation of different image features for these tasks, find that features learned in a multi-layer network generally perform best -- even when trained with object class (not style) labels.
Abstract: The style of an image plays a significant role in how it is viewed, but style has received little attention in computer vision research. We describe an approach to predicting style of images, and perform a thorough evaluation of different image features for these tasks. We find that features learned in a multi-layer network generally perform best – even when trained with object class (not style) labels. Our large-scale learning methods results in the best published performance on an existing dataset of aesthetic ratings and photographic style annotations. We present two novel datasets: 80K Flickr photographs annotated with 20 curated style labels, and 85K paintings annotated with 25 style/genre labels. Our approach shows excellent classification performance on both datasets. We use the learned classifiers to extend traditional tag-based image search to consider stylistic constraints, and demonstrate cross-dataset understanding of style.

322 citations


Proceedings Article
01 Jan 2014
TL;DR: This work proposes a computationally efficient algorithm which is able to produce accurate results on a large variety of unconstrained videos and outperforms current state-of-the-art methods.
Abstract: We address the problem of Foreground/Background segmentation of “unconstrained” video. By “unconstrained” we mean that the moving objects and the background scene may be highly non-rigid (e.g., waves in the sea); the camera may undergo a complex motion with 3D parallax; moving objects may suffer from motion blur, large scale and illumination changes, etc. Most existing segmentation methods fail on such unconstrained videos, especially in the presence of highly non-rigid motion and low resolution. We propose a computationally efficient algorithm which is able to produce accurate results on a large variety of unconstrained videos. This is obtained by casting the video segmentation problem as a voting scheme on the graph of similar (‘re-occurring’) regions in the video sequence. We start from crude saliency votes at each pixel, and iteratively correct those votes by ‘consensus voting’ of re-occurring regions across the video sequence. The power of our consensus voting comes from the non-locality of the region re-occurrence, both in space and in time – enabling propagation of diverse and rich information across the entire video sequence. Qualitative and quantitative experiments indicate that our approach outperforms current state-of-the-art methods.

296 citations


Proceedings ArticleDOI
01 Jan 2014
TL;DR: This work shows for the first time that an event stream, with no additional sensing, can be used to track accurate camera rotation while building a persistent and high quality mosaic of a scene which is super-resolution accurate and has high dynamic range.
Abstract: An event camera is a silicon retina which outputs not a sequence of video frames like a standard camera, but a stream of asynchronous spikes, each with pixel location, sign and precise timing, indicating when individual pixels record a threshold log intensity change. By encoding only image change, it offers the potential to transmit the information in a standard video but at vastly reduced bitrate, and with huge added advantages of very high dynamic range and temporal resolution. However, event data calls for new algorithms, and in particular we believe that algorithms which incrementally estimate global scene models are best placed to take full advantages of its properties. Here, we show for the first time that an event stream, with no additional sensing, can be used to track accurate camera rotation while building a persistent and high quality mosaic of a scene which is super-resolution accurate and has high dynamic range. Our method involves parallel camera rotation tracking and template reconstruction from estimated gradients, both operating on an event-by-event basis and based on probabilistic filtering.

234 citations


Proceedings ArticleDOI
01 Jan 2014
TL;DR: This work investigates the use of such freely available 3D models for multicategory 2D object detection and proposes a simple and fast adaptation approach based on decorrelated features, which performs comparably to existing methods trained on large-scale real image domains.
Abstract: The most successful 2D object detection methods require a large number of images annotated with object bounding boxes to be collected for training. We present an alternative approach that trains on virtual data rendered from 3D models, avoiding the need for manual labeling. Growing demand for virtual reality applications is quickly bringing about an abundance of available 3D models for a large variety of object categories. While mainstream use of 3D models in vision has focused on predicting the 3D pose of objects, we investigate the use of such freely available 3D models for multicategory 2D object detection. To address the issue of dataset bias that arises from training on virtual data and testing on real images, we propose a simple and fast adaptation approach based on decorrelated features. We also compare two kinds of virtual data, one rendered with real-image textures and one without. Evaluation on a benchmark domain adaptation dataset demonstrates that our method performs comparably to existing methods trained on large-scale real image domains.

194 citations


Proceedings ArticleDOI
26 Jun 2014
TL;DR: An in depth analysis of ten object proposal methods along with four baselines regarding ground truth annotation recall (on Pascal VOC 2007 and ImageNet 2013), repeatability, and impact on DPM detector performance are provided.
Abstract: Current top performing Pascal VOC object detectors employ detection proposals to guide the search for objects thereby avoiding exhaustive sliding window search across images. Despite the popularity of detection proposals, it is unclear which trade-offs are made when using them during object detection. We provide an in depth analysis of ten object proposal methods along with four baselines regarding ground truth annotation recall (on Pascal VOC 2007 and ImageNet 2013), repeatability, and impact on DPM detector performance. Our findings show common weaknesses of existing methods, and provide insights to choose the most adequate method for different settings.

173 citations


Proceedings ArticleDOI
02 Sep 2014
TL;DR: A model-free tracker that outperforms the existing state-of-the-art algorithms and rarely loses the track of the target object is proposed and a class-specific version of the proposed method that is tailored for tracking of a particular object class such as human faces is introduced.
Abstract: Defining hand-crafted feature representations needs expert knowledge, requires timeconsuming manual adjustments, and besides, it is arguably one of the limiting factors of object tracking. In this paper, we propose a novel solution to automatically relearn the most useful feature representations during the tracking process in order to accurately adapt appearance changes, pose and scale variations while preventing from drift and tracking failures. We employ a candidate pool of multiple Convolutional Neural Networks (CNNs) as a data-driven model of different instances of the target object. Individually, each CNN maintains a specific set of kernels that favourably discriminate object patches from their surrounding background using all available low-level cues. These kernels are updated in an online manner at each frame after being trained with just one instance at the initialization of the corresponding CNN. Given a frame, the most promising CNNs in the pool are selected to evaluate the hypothesises for the target object. The hypothesis with the highest score is assigned as the current detection window and the selected models are retrained using a warm-start back-propagation which optimizes a structural loss function. In addition to the model-free tracker, we introduce a class-specific version of the proposed method that is tailored for tracking of a particular object class such as human faces. Our experiments on a large selection of videos from the recent benchmarks demonstrate that our method outperforms the existing state-of-the-art algorithms and rarely loses the track of the target object.

166 citations



Proceedings Article
01 Jan 2014
TL;DR: In this paper, a fully unsupervised approach for the discovery of task relevant objects and how these objects have been used is presented, where a Task Relevant Object (TRO) is an object, or part of an object with which a person interacts during task performance.
Abstract: We present a fully unsupervised approach for the discovery of i) task relevant objects and ii) how these objects have been used. A Task Relevant Object (TRO) is an object, or part of an object, with which a person interacts during task performance. Given egocentric video from multiple operators, the approach can discover objects with which the users interact, both static objects such as a coffee machine as well as movable ones such as a cup. Importantly, we also introduce the term Mode of Interaction (MOI) to refer to the different ways in which TROs are used. Say, a cup can be lifted, washed, or poured into. When harvesting interactions with the same object from multiple operators, common MOIs can be found. Setup and Dataset: Using a wearable camera and gaze tracker (Mobile Eye-XG from ASL), egocentric video is collected of users performing tasks, along with their gaze in pixel coordinates. Six locations were chosen: kitchen, workspace, laser printer, corridor with a locked door, cardiac gym and weight-lifting machine. The Bristol Egocentric Object Interactions Dataset is publically available .

Proceedings ArticleDOI
01 Jan 2014
TL;DR: This document is distributed unchanged freely in print or electronic forms because the authors believe it is in the public interest to do so.
Abstract: (c) 2014. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: This work proposes a method for fully automatic calibration of traffic surveillance cameras for targeted applications, which allows for calibration of the camera – including scale – without any user input, only from several minutes of input surveillance video.
Abstract: We propose a method for fully automatic calibration of traffic surveillance cameras. This method allows for calibration of the camera – including scale – without any user input, only from several minutes of input surveillance video. The targeted applications include speed measurement, measurement of vehicle dimensions, vehicle classification, etc. The first step of our approach is camera calibration by determining three vanishing points defining the stream of vehicles. The second step is construction of 3D bounding boxes of individual vehicles and their measurement up to scale. We propose to first construct the projection of the bounding boxes and then, by using the camera calibration obtained earlier, create their 3D representation. In the third step, we use the dimensions of the 3D bounding boxes for calibration of the scene scale. We collected a dataset with ground truth speed and distance measurements and evaluate our approach on it. The achieved mean accuracy of speed and distance measurement is below 2%. Our efficient C++ implementation runs in real time on a low-end processor (Core i3) with a safe margin even for full-HD videos.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: This work extends regression forests to infer missing depth data of image features and 3D pose simultaneously, and hypothesizes the depth of the features by sweeping with a plane through the 3D volume of potential joint locations.
Abstract: In this work we address the problem of estimating the 3D human pose from a single RGB image, which is a challenging problem since different 3D poses may have similar 2D projections. Following the success of regression forests for 3D pose estimation from depth data or 2D pose estimation from RGB images, we extend regression forests to infer missing depth data of image features and 3D pose simultaneously. Since we do not observe depth for inference or training directly, we hypothesize the depth of the features by sweeping with a plane through the 3D volume of potential joint locations. The regression forests are then combined with a pictorial structure framework, which is extended to 3D. The approach is evaluated on two challenging benchmarks where stateof-the-art performance is achieved.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: This work draws upon recent work in mid-level discriminative patches to develop a novel method for reranking paintings based on their spatial consistency with natural images of an object category, which combines both class based and instance based retrieval in a single framework.
Abstract: The objective of this work is to recognize object categories (such as animals and vehicles) in paintings, whilst learning these categories from natural images. This is a challenging problem given the substantial differences between paintings and natural images, and variations in depiction of objects in paintings. We first demonstrate that classifiers trained on natural images of an object category have quite some success in retrieving paintings containing that category. We then draw upon recent work in mid-level discriminative patches to develop a novel method for reranking paintings based on their spatial consistency with natural images of an object category. This method combines both class based and instance based retrieval in a single framework. We quantitatively evaluate the method over a number of classes from the PASCAL VOC dataset, and demonstrate significant improvements in rankings of the retrieved paintings over a variety of object categories.

Proceedings ArticleDOI
05 Sep 2014
TL;DR: This work presents the first reliable, validated and multi-scene category ground truth for shadow removal algorithms which overcomes limitations in existing data sets -- such as inconsistencies between shadow and shadow-free images and limited variations of shadows.
Abstract: We present an interactive, robust and high quality method for fast shadow removal. To perform detection we use an on-the-fly learning approach guided by two rough user inputs for the pixels of the shadow and the lit area. From this we derive a fusion image that magnifies shadow boundary intensity change due to illumination variation. After detection, we perform shadow removal by registering the penumbra to a normalised frame which allows us to efficiently estimate non-uniform shadow illumination changes, resulting in accurate and robust removal. We also present the first reliable, validated and multi-scene category ground truth for shadow removal algorithms which overcomes limitations in existing data sets -- such as inconsistencies between shadow and shadow-free images and limited variations of shadows. Using our data, we perform the most thorough comparison of state of the art shadow removal methods to date. Our algorithm outperforms the state of the art, and we supply our P-code and evaluation data and scripts to encourage future open comparisons.

Proceedings ArticleDOI
01 Sep 2014
TL;DR: Experimental results show that the proposed approach is highly efficient and it outperforms state-of-the-art haze removal algorithms in terms of the dehazing effect as well.
Abstract: In this paper, we propose a simple but powerful prior, color attenuation prior, for haze removal from a single input hazy image By creating a linear model for modelling the scene depth of the hazy image under this novel prior and learning the parameters of the model by using a supervised learning method, the depth information can be well recovered With the depth map of the hazy image, we can easily remove haze from a single image Experimental results show that the proposed approach is highly efficient and it outperforms state-of-the-art haze removal algorithms in terms of the dehazing effect as well


Proceedings ArticleDOI
01 Jan 2014
TL;DR: This document is intended to assist in the preparation of future generations of interpreters and interpreters for the deaf and hard of hearing and is suitable for use with interpreters.
Abstract: © 2014. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms.


Proceedings ArticleDOI
01 Jan 2014
TL;DR: Regularized Max Pooling outperforms the state-of-the-art performance by a wide margin on the challenging PASCAL VOC2012 dataset for human action recognition on still images.
Abstract: We propose Regularized Max Pooling (RMP) for image classification. RMP classifiesan image (oran image region) by extracting featurevectors at multiple subwindows at multiple locations and scales. Unlike Spatial Pyramid Matching where the subwindows are defined purely based on geometric correspondence, RMP accounts for the deformation of discriminative parts. The amount of deformation and the discriminative ability for multiple parts are jointly learned during training. RMP outperforms the state-of-the-art performance by a wide margin on the challenging PASCAL VOC2012 dataset for human action recognition on still images.

Proceedings ArticleDOI
16 Apr 2014
TL;DR: Dense Neural Patterns, short for DNPs, are introduced, which are dense local features derived from discriminatively trained deep convolutional neural networks that can be easily plugged into conventional detection frameworks in the same way as other denseLocal features like HOG or LBP.
Abstract: This paper addresses the challenge of establishing a bridge between deep convolutional neural networks and conventional object detection frameworks for accurate and efficient generic object detection. We introduce Dense Neural Patterns , short for DNPs, which are dense local features derived from discriminatively trained deep convolutional neural networks. DNPs can be easily plugged into conventional detection frameworks in the same way as other dense local features(like HOG or LBP). The effectiveness of the proposed approach is demonstrated with the Regionlets object detection framework. It achieved 46.1% mean average precision on the PASCAL VOC 2007 dataset, and 44.1% on the PASCAL VOC 2010 dataset, which dramatically improves the original Regionlets approach without DNPs. It is the first approach efficiently applying dee p convolutional features for conventional object detection models.

Proceedings Article
01 Jan 2014
TL;DR: This document is distributed unchanged freely in print or electronic forms because the authors believe it to be of high quality and fit for purpose of publication.
Abstract: (c) 2014. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: This work addresses the challenge of analysing the quality of human movements from visual information which has use in a broad range of applications, from diagnosis and rehabilitation to movement optimisation in sports science.
Abstract: This work addresses the challenge of analysing the quality of human movements from visual information which has use in a broad range of applications, from diagnosis and rehabilitation to movement optimisation in sports science. Traditionally, such assessment is performed as a binary classification between normal and abnormal by comparison against normal and abnormal movement models, e.g. [5]. Since a single model of abnormal movement cannot encompass the variety of abnormalities, another class of methods only compares against one model of normal movement, e.g. [4]. We adopt this latter strategy and propose a continuous assessment of movement quality, rather than a binary classification, by quantifying the deviation from a normal model. In addition, while most methods can only analyse a movement after its completion e.g. [6], this assessment is performed on a frame-by-frame basis in order to allow fast system response in case of an emergency, such as a fall. Methods such as [4, 6] are specific to one type of movement, mostly due to the features used. In this work, we aim to represent a large variety of movements by exploiting full body information. We use a depth camera and a skeleton tracker [3] to obtain the position of the main joints of the body, as seen in Fig. 1. We normalise this skeleton for global position and orientation of the camera, and for the varying height of the subjects, e.g. using Procrustes analysis. The normalised skeletons have high dimensionality and tend to contain outliers. Thus, the dimensionality is reduced using Diffusion Maps [1] which is modified by including the extension that Gerber et al. [2] presented to deal with outliers in Laplacian Eigenmaps. The resulting high level feature vector Y, obtained from the normalised skeleton at one frame, represents an individual pose and is used to build a statistical model of normal movement. Our statistical model is made up of two components that describe the normal poses and the normal dynamics of the movement. The pose model is in the form of the probability density function (pdf) fY (y) of a random variable Y that takes as value y = Y our pose feature vector Y. The pdf is learnt from all the frames of training sequences that contain normal instances of the movement, using a Parzen window estimator. The quality of a new pose yt at frame t is then assessed as the log-likelihood of being described by the pose model, i.e.

Proceedings ArticleDOI
01 Sep 2014
TL;DR: This work solves the uncalibrated photometric stereo problem with lights placed near the scene and proposes a solution for reconstructing the normal map, the albedo, the light positions and the light intensities of a scene given only a sequence of near-light images.
Abstract: In this work we solve the uncalibrated photometric stereo problem with lights placed near the scene. We investigate different image formation models and find the one that best fits our observations. Although the devised model is more complex than its far-light counterpart, we show that under a global linear ambiguity the reconstruction is possible up to a rotation and scaling, which can be easily fixed. We also propose a solution for reconstructing the normal map, the albedo, the light positions and the light intensities of a scene given only a sequence of near-light images. This is done in an alternating minimization framework which first estimates both the normals and the albedo, and then the light positions and intensities. We validate our method on real world experiments and show that a near-light model leads to a significant improvement in the surface reconstruction compared to the classic distant illumination case.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: This paper observes that the same scene, viewed under two different coloured lights for the same algorithm, leads to different recovery errors despite the fact that when the authors remove the colour bias due to illuminant exactly the same reproduction is produced.
Abstract: Only if we can estimate the colour of the prevailing light - and discount it from the image - can image colour be used as a stable cue for indexing, recognition and tracking (amongst other tasks). Almost all illumination estimation research uses the angle between the RGB of the actual measured illuminant colour and that estimated one as the recovery error. However here we identify a problem with this metric. We observe that the same scene, viewed under two different coloured lights for the same algorithm, leads to different recovery errors despite the fact that when we remove the colour bias due to illuminant (we divide out by light) exactly the same reproduction is produced. We begin this paper by quantifying the scale of this problem. For a given scene and algorithm, we solve for the range of recovery angular errors that can be observed given all colours of light. We also show that the lowest errors are for red, green and blue lights and the largest for cyans, magentas and yellows. Next, we propose a new reproduction angular error which is defined as the angle between the image RGB of a white surface when the actual and estimated illuminations are 'divided out'. Reassuringly, this reproduction error metric, by construction, gives the same error for the same algorithm-scene pair. For many algorithms and many benchmark datasets we recompute the illuminant estimation performance of a range of algorithms for the new reproduction error and then compare against the algorithm rankings for the old recovery error. We find that the overall rankings of algorithms remains, broadly, unchanged - though there can be local switches in rank - and the algorithm parameters provide that the best illuminant estimation performance depend on the error metric used.

Proceedings ArticleDOI
01 Sep 2014
TL;DR: This paper considers viewpoint estimation as a 1-vs-all classification problem on the previously detected object bounding box and shows that the modern representations based on Fisher encoding and convolutional neural network based features together with a neighbor viewpoints suppression strategy on the training data lead to comparable or even better performance than 3D methods.
Abstract: Recent top performing methods for viewpoint estimation make use of 3D information like 3D CAD models or 3D landmarks to build a 3D representation of the class. These 3D annotations are expensive and not really available for many classes. In this paper we investigate whether and how comparable performance can be obtained without any 3D information. We consider viewpoint estimation as a 1-vs-all classification problem on the previously detected object bounding box. In this framework we compare several features and parameter configurations and show that the modern representations based on Fisher encoding and convolutional neural network based features together with a neighbor viewpoints suppression strategy on the training data lead to comparable or even better performance than 3D methods.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: This paper proposes a general framework to solve Non-Rigid Shape-from-Motion (NRSfM) with the perspective camera under isometric deformations and derives an analytic solution which involves convex, linear least-squares optimization only, and outperforms existing works.
Abstract: This paper proposes a general framework to solve Non-Rigid Shape-from-Motion (NRSfM) with the perspective camera under isometric deformations. Contrary to the usual low-rank linear shape basis, isometry allows us to recover complex shape deformations from a sparse set of images. Existing methods suffer from ambiguities and may be very expensive to solve. We bring four main contributions. First, we formulate isometric NRSfM as a system of first-order Partial Differential Equations (PDE) involving the shape’s depth and normal field and an unknown template. Second, we show this system cannot be locally resolved. Third, we introduce the concept of infinitesimal planarity and show that it makes the system locally solvable for at least three views. Fourth, we derive an analytic solution which involves convex, linear least-squares optimization only, and outperforms existing works.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: This paper studies two complementary cross-modal prediction tasks: predicting text given an image (“Im2Text”), and predicting image(s) given a piece of text (‘Text2Im’), and proposes a novel Structural SVM based unified formulation for these two tasks.
Abstract: Building bilateral semantic associations between images and texts is among the fundamental problems in computer vision. In this paper, we study two complementary cross-modal prediction tasks: (i) predicting text(s) given an image (“Im2Text”), and (ii) predicting image(s) given a piece of text (“Text2Im”). We make no assumption on the specific form of text; i.e., it could be either a set of labels, phrases, or even captions. We pose both these tasks in a retrieval framework. For Im2Text, given a query image, our goal is to retrieve a ranked list of semantically relevant texts from an independent textcorpus (i.e., texts with no corresponding images). Similarly, for Text2Im, given a query text, we aim to retrieve a ranked list of semantically relevant images from a collection of unannotated images (i.e., images without any associated textual meta-data). We propose a novel Structural SVM based unified formulation for these two tasks. For both visual and textual data, two types of representations are investigated. These are based on: (1) unimodal probability distributions over topics learned using latent Dirichlet allocation, and (2) explicitly learned multi-modal correlations using canonical correlation analysis. Extensive experiments on three popular datasets (two medium and one web-scale) demonstrate that our framework gives promising results compared to existing models under various settings, thus confirming its efficacy for both the tasks.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: A sequential solution to dense non-rigid structure from motion that recovers the camera motion and 3D shape of non- Rigid objects by processing a monocular image sequence as the data arrives is described.
Abstract: © 2014. The copyright of this document resides with its authors. This paper describes a sequential solution to dense non-rigid structure from motion that recovers the camera motion and 3D shape of non-rigid objects by processing a monocular image sequence as the data arrives. We propose to model the time-varying shape with a probabilistic linear subspace of mode shapes obtained from continuum mechanics. To efficiently encode the deformations of dense 3D shapes that contain a large number of mesh vertexes, we propose to compute the deformation modes on a down sampled rest shape using finite element modal analysis at a low computational cost. This sparse shape basis is then grown back to dense exploiting the shape functions within a finite element. With this probabilistic low-rank constraint, we estimate camera pose and non-rigid shape in each frame using expectation maximization over a sliding window of frames. Since the time-varying weights are marginalized out, our approach only estimates a small number of parameters per frame, and hence can potentially run in real time. We evaluate our algorithm on both synthetic and real sequences with 3D ground truth data for different objects ranging from inextensible to extensible deformations and from sparse to dense shapes. We show the advantages of our approach with respect to competing sequential methods.