scispace - formally typeset
Search or ask a question

Showing papers in "International Journal of Computer Vision in 2011"


Journal ArticleDOI
TL;DR: This paper proposes a new set of benchmarks and evaluation methods for the next generation of optical flow algorithms and analyzes the results obtained to date to draw a large number of conclusions.
Abstract: The quantitative evaluation of optical flow algorithms by Barron et al. (1994) led to significant advances in performance. The challenges for optical flow algorithms today go beyond the datasets and evaluation methods proposed in that paper. Instead, they center on problems associated with complex natural scenes, including nonrigid motion, real sensor noise, and motion discontinuities. We propose a new set of benchmarks and evaluation methods for the next generation of optical flow algorithms. To that end, we contribute four types of data to test different aspects of optical flow algorithms: (1) sequences with nonrigid motion where the ground-truth flow is determined by tracking hidden fluorescent texture, (2) realistic synthetic sequences, (3) high frame-rate video used to study interpolation error, and (4) modified stereo sequences of static scenes. In addition to the average angular error used by Barron et al., we compute the absolute flow endpoint error, measures for frame interpolation error, improved statistics, and results at motion discontinuities and in textureless regions. In October 2007, we published the performance of several well-known methods on a preliminary version of our data to establish the current state of the art. We also made the data freely available on the web at http://vision.middlebury.edu/flow/ . Subsequently a number of researchers have uploaded their results to our website and published papers using the data. A significant improvement in performance has already been achieved. In this paper we analyze the results obtained to date and draw a large number of conclusions from them.

2,534 citations


Journal ArticleDOI
TL;DR: This work proposes a principled optimization strategy where nonparametric representations of these likelihoods are maximized within a hierarchy of smoothed estimates and is shown to outperform some common existing methods on the task of generic face fitting.
Abstract: Deformable model fitting has been actively pursued in the computer vision community for over a decade. As a result, numerous approaches have been proposed with varying degrees of success. A class of approaches that has shown substantial promise is one that makes independent predictions regarding locations of the model's landmarks, which are combined by enforcing a prior over their joint motion. A common theme in innovations to this approach is the replacement of the distribution of probable landmark locations, obtained from each local detector, with simpler parametric forms. In this work, a principled optimization strategy is proposed where nonparametric representations of these likelihoods are maximized within a hierarchy of smoothed estimates. The resulting update equations are reminiscent of mean-shift over the landmarks but with regularization imposed through a global prior over their joint motion. Extensions to handle partial occlusions and reduce computational complexity are also presented. Through numerical experiments, this approach is shown to outperform some common existing methods on the task of generic face fitting.

908 citations


Journal ArticleDOI
TL;DR: This work presents a carefully designed dataset of video sequences of planar textures with ground truth, which includes various geometric changes, lighting conditions, and levels of motion blur, and presents a comprehensive quantitative evaluation of detector-descriptor-based visual camera tracking based on this testbed.
Abstract: Applications for real-time visual tracking can be found in many areas, including visual odometry and augmented reality. Interest point detection and feature description form the basis of feature-based tracking, and a variety of algorithms for these tasks have been proposed. In this work, we present (1) a carefully designed dataset of video sequences of planar textures with ground truth, which includes various geometric changes, lighting conditions, and levels of motion blur, and which may serve as a testbed for a variety of tracking-related problems, and (2) a comprehensive quantitative evaluation of detector-descriptor-based visual camera tracking based on this testbed. We evaluate the impact of individual algorithm parameters, compare algorithms for both detection and description in isolation, as well as all detector-descriptor combinations as a tracking solution. In contrast to existing evaluations, which aim at different tasks such as object recognition and have limited validity for visual tracking, our evaluation is geared towards this application in all relevant factors (performance measures, testbed, candidate algorithms). To our knowledge, this is the first work that comprehensively compares these algorithms in this context, and in particular, on video streams.

441 citations


Journal ArticleDOI
TL;DR: A unified model for multi-class object recognition is introduced that casts the problem as a structured prediction task and how to formulate learning as a convex optimization problem is shown.
Abstract: Many state-of-the-art approaches for object recognition reduce the problem to a 0-1 classification task. This allows one to leverage sophisticated machine learning techniques for training classifiers from labeled examples. However, these models are typically trained independently for each class using positive and negative examples cropped from images. At test-time, various post-processing heuristics such as non-maxima suppression (NMS) are required to reconcile multiple detections within and between different classes for each image. Though crucial to good performance on benchmarks, this post-processing is usually defined heuristically. We introduce a unified model for multi-class object recognition that casts the problem as a structured prediction task. Rather than predicting a binary label for each image window independently, our model simultaneously predicts a structured labeling of the entire image (Fig. 1). Our model learns statistics that capture the spatial arrangements of various object classes in real images, both in terms of which arrangements to suppress through NMS and which arrangements to favor through spatial co-occurrence statistics. We formulate parameter estimation in our model as a max-margin learning problem. Given training images with ground-truth object locations, we show how to formulate learning as a convex optimization problem. We employ the cutting plane algorithm of Joachims et al. (Mach. Learn. 2009) to efficiently learn a model from thousands of training images. We show state-of-the-art results on the PASCAL VOC benchmark that indicate the benefits of learning a global model encapsulating the spatial layout of multiple object classes (a preliminary version of this work appeared in ICCV 2009, Desai et al., IEEE international conference on computer vision, 2009).

375 citations


Journal ArticleDOI
TL;DR: A system based on computer vision is presented which detects the presence of rain or snow and the applications are numerous and include the detection of critical weather conditions, the observation of weather, the reliability improvement of video-surveillance systems and rain rendering.
Abstract: The detection of bad weather conditions is crucial for meteorological centers, specially with demand for air, sea and ground traffic management. In this article, a system based on computer vision is presented which detects the presence of rain or snow. To separate the foreground from the background in image sequences, a classical Gaussian Mixture Model is used. The foreground model serves to detect rain and snow, since these are dynamic weather phenomena. Selection rules based on photometry and size are proposed in order to select the potential rain streaks. Then a Histogram of Orientations of rain or snow Streaks (HOS), estimated with the method of geometric moments, is computed, which is assumed to follow a model of Gaussian-uniform mixture. The Gaussian distribution represents the orientation of the rain or the snow whereas the uniform distribution represents the orientation of the noise. An algorithm of expectation maximization is used to separate these two distributions. Following a goodness-of-fit test, the Gaussian distribution is temporally smoothed and its amplitude allows deciding the presence of rain or snow. When the presence of rain or of snow is detected, the HOS makes it possible to detect the pixels of rain or of snow in the foreground images, and to estimate the intensity of the precipitation of rain or of snow. The applications of the method are numerous and include the detection of critical weather conditions, the observation of weather, the reliability improvement of video-surveillance systems and rain rendering.

314 citations


Journal ArticleDOI
TL;DR: This article presents an approach for modeling landmarks based on large-scale, heavily contaminated image collections gathered from the Internet that efficiently combines 2D appearance and 3D geometric constraints to extract scene summaries and construct 3D models.
Abstract: This article presents an approach for modeling landmarks based on large-scale, heavily contaminated image collections gathered from the Internet. Our system efficiently combines 2D appearance and 3D geometric constraints to extract scene summaries and construct 3D models. In the first stage of processing, images are clustered based on low-dimensional global appearance descriptors, and the clusters are refined using 3D geometric constraints. Each valid cluster is represented by a single iconic view, and the geometric relationships between iconic views are captured by an iconic scene graph. Using structure from motion techniques, the system then registers the iconic images to efficiently produce 3D models of the different aspects of the landmark. To improve coverage of the scene, these 3D models are subsequently extended using additional, non-iconic views. We also demonstrate the use of iconic images for recognition and browsing. Our experimental results demonstrate the ability to process datasets containing up to 46,000 images in less than 20 hours, using a single commodity PC equipped with a graphics card. This is a significant advance towards Internet-scale operation.

311 citations


Journal ArticleDOI
TL;DR: The novel anisotropic smoothness is designed to work complementary to the data term and incorporates directional information from the data constraints to enable a filling-in of information solely in the direction where the dataterm gives no information, yielding an optimal complementary smoothing behaviour.
Abstract: Most variational optic flow approaches just consist of three constituents: a data term, a smoothness term and a smoothness weight. In this paper, we present an approach that harmonises these three components. We start by developing an advanced data term that is robust under outliers and varying illumination conditions. This is achieved by using constraint normalisation, and an HSV colour representation with higher order constancy assumptions and a separate robust penalisation. Our novel anisotropic smoothness is designed to work complementary to the data term. To this end, it incorporates directional information from the data constraints to enable a filling-in of information solely in the direction where the data term gives no information, yielding an optimal complementary smoothing behaviour. This strategy is applied in the spatial as well as in the spatio-temporal domain. Finally, we propose a simple method for automatically determining the optimal smoothness weight. This method bases on a novel concept that we call "optimal prediction principle" (OPP). It states that the flow field obtained with the optimal smoothness weight allows for the best prediction of the next frames in the image sequence. The benefits of our "optic flow in harmony" (OFH) approach are demonstrated by an extensive experimental validation and by a competitive performance at the widely used Middlebury optic flow benchmark.

261 citations


Journal ArticleDOI
Christopher Mei1, Gabe Sibley1, Mark Cummins1, Paul Newman1, Ian Reid1 
TL;DR: A relative simultaneous localisation and mapping system (RSLAM) for the constant-time estimation of structure and motion using a binocular stereo camera system as the sole sensor and a topo-metric representation in terms of a sequence of relative locations is described.
Abstract: Large scale exploration of the environment requires a constant time estimation engine. Bundle adjustment or pose relaxation do not fulfil these requirements as the number of parameters to solve grows with the size of the environment. We describe a relative simultaneous localisation and mapping system (RSLAM) for the constant-time estimation of structure and motion using a binocular stereo camera system as the sole sensor. Achieving robustness in the presence of difficult and changing lighting conditions and rapid motion requires careful engineering of the visual processing, and we describe a number of innovations which we show lead to high accuracy and robustness. In order to achieve real-time performance without placing severe limits on the size of the map that can be built, we use a topo-metric representation in terms of a sequence of relative locations. When combined with fast and reliable loop-closing, we mitigate the drift to obtain highly accurate global position estimates without any global minimisation. We discuss some of the issues that arise from using a relative representation, and evaluate our system on long sequences processed at a constant 30---45 Hz, obtaining precisions down to a few meters over distances of a few kilometres.

243 citations


Journal ArticleDOI
TL;DR: This work proposes a general variational framework for non-local image inPainting, from which important and representative previous inpainting schemes can be derived, in addition to leading to novel ones.
Abstract: Non-local methods for image denoising and inpainting have gained considerable attention in recent years. This is in part due to their superior performance in textured images, a known weakness of purely local methods. Local methods on the other hand have demonstrated to be very appropriate for the recovering of geometric structures such as image edges. The synthesis of both types of methods is a trend in current research. Variational analysis in particular is an appropriate tool for a unified treatment of local and non-local methods. In this work we propose a general variational framework for non-local image inpainting, from which important and representative previous inpainting schemes can be derived, in addition to leading to novel ones. We explicitly study some of these, relating them to previous work and showing results on synthetic and real images.

232 citations


Journal ArticleDOI
TL;DR: By exploiting the nonholonomic constraints of wheeled vehicles it is possible to use a restrictive motion model which allows us to parameterize the motion with only 1 point correspondence and results in the two most efficient algorithms for removing outliers: 1-point RANSAC and histogram voting.
Abstract: This paper presents a new method to estimate the relative motion of a vehicle from images of a single camera. The computational cost of the algorithm is limited only by the feature extraction and matching process, as the outlier removal and the motion estimation steps take less than a fraction of millisecond with a normal laptop computer. The biggest problem in visual motion estimation is data association; matched points contain many outliers that must be detected and removed for the motion to be accurately estimated. In the last few years, a very established method for removing outliers has been the "5-point RANSAC" algorithm which needs a minimum of 5 point correspondences to estimate the model hypotheses. Because of this, however, it can require up to several hundreds of iterations to find a set of points free of outliers. In this paper, we show that by exploiting the nonholonomic constraints of wheeled vehicles it is possible to use a restrictive motion model which allows us to parameterize the motion with only 1 point correspondence. Using a single feature correspondence for motion estimation is the lowest model parameterization possible and results in the two most efficient algorithms for removing outliers: 1-point RANSAC and histogram voting. To support our method we run many experiments on both synthetic and real data and compare the performance with a state-of-the-art approach. Finally, we show an application of our method to visual odometry by recovering a 3 Km trajectory in a cluttered urban environment and in real-time.

223 citations


Journal ArticleDOI
TL;DR: It is shown that for a discretization based on Parseval frames the gradient descent reprojection and the alternating split Bregman algorithm are equivalent and turn out to be a frame shrinkage method and a numerical comparison with multistep methods is presented.
Abstract: We examine the underlying structure of popular algorithms for variational methods used in image processing. We focus here on operator splittings and Bregman methods based on a unified approach via fixed point iterations and averaged operators. In particular, the recently proposed alternating split Bregman method can be interpreted from different points of view--as a Bregman, as an augmented Lagrangian and as a Douglas-Rachford splitting algorithm which is a classical operator splitting method. We also study similarities between this method and the forward-backward splitting method when applied to two frequently used models for image denoising which employ a Besov-norm and a total variation regularization term, respectively. In the first setting, we show that for a discretization based on Parseval frames the gradient descent reprojection and the alternating split Bregman algorithm are equivalent and turn out to be a frame shrinkage method. For the total variation regularizer, we also present a numerical comparison with multistep methods.

Journal ArticleDOI
TL;DR: A new interactive method for tubular structure extraction based on a variant of the minimal path method that models the vessel as a centerline and surface and built an anisotropic metric that is well oriented along the direction of the vessels, admits higher velocity on the centerline, and provides a good estimate of the vessel radius.
Abstract: We present a new interactive method for tubular structure extraction. The main application and motivation for this work is vessel tracking in 2D and 3D images. The basic tools are minimal paths solved using the fast marching algorithm. This allows interactive tools for the physician by clicking on a small number of points in order to obtain a minimal path between two points or a set of paths in the case of a tree structure. Our method is based on a variant of the minimal path method that models the vessel as a centerline and surface. This is done by adding one dimension for the local radius around the centerline. The crucial step of our method is the definition of the local metrics to minimize. We have chosen to exploit the tubular structure of the vessels one wants to extract to built an anisotropic metric. The designed metric is well oriented along the direction of the vessel, admits higher velocity on the centerline, and provides a good estimate of the vessel radius. Based on the optimally oriented flux this measure is required to be robust against the disturbance introduced by noise or adjacent structures with intensity similar to the target vessel. We obtain promising results on noisy synthetic and real 2D and 3D images and we present a clinical validation.

Journal ArticleDOI
TL;DR: A variational framework for the estimation of stereoscopic scene flow, i.e., the motion of points in the three-dimensional world from stereo image sequences, which partially decouple the depth estimation from the motion estimation, which has many practical advantages.
Abstract: Building upon recent developments in optical flow and stereo matching estimation, we propose a variational framework for the estimation of stereoscopic scene flow, i.e., the motion of points in the three-dimensional world from stereo image sequences. The proposed algorithm takes into account image pairs from two consecutive times and computes both depth and a 3D motion vector associated with each point in the image. In contrast to previous works, we partially decouple the depth estimation from the motion estimation, which has many practical advantages. The variational formulation is quite flexible and can handle both sparse or dense disparity maps. The proposed method is very efficient; with the depth map being computed on an FPGA, and the scene flow computed on the GPU, the proposed algorithm runs at frame rates of 20 frames per second on QVGA images (320×240 pixels). Furthermore, we present solutions to two important problems in scene flow estimation: violations of intensity consistency between input images, and the uncertainty measures for the scene flow result.

Journal ArticleDOI
TL;DR: This work improves the logDemons by integrating elasticity and incompressibility for soft-tissue tracking, and replaces the Gaussian smoothing by an efficient elastic-like regulariser based on isotropic differential quadratic forms of vector fields.
Abstract: Tracking soft tissues in medical images using non-linear image registration algorithms requires methods that are fast and provide spatial transformations consistent with the biological characteristics of the tissues. LogDemons algorithm is a fast non-linear registration method that computes diffeomorphic transformations parameterised by stationary velocity fields. Although computationally efficient, its use for tissue tracking has been limited because of its ad-hoc Gaussian regularisation, which hampers the implementation of more biologically motivated regularisations. In this work, we improve the logDemons by integrating elasticity and incompressibility for soft-tissue tracking. To that end, a mathematical justification of demons Gaussian regularisation is proposed. Building on this result, we replace the Gaussian smoothing by an efficient elastic-like regulariser based on isotropic differential quadratic forms of vector fields. The registration energy functional is finally minimised under the divergence-free constraint to get incompressible deformations. As the elastic regulariser and the constraint are linear, the method remains computationally tractable and easy to implement. Tests on synthetic incompressible deformations showed that our approach outperforms the original logDemons in terms of elastic incompressible deformation recovery without reducing the image matching accuracy. As an application, we applied the proposed algorithm to estimate 3D myocardium strain on clinical cine MRI of two adult patients. Results showed that incompressibility constraint improves the cardiac motion recovery when compared to the ground truth provided by 3D tagged MRI.

Journal ArticleDOI
TL;DR: This paper proposes a hierarchical segmentation process, based on agglomerative merging, that re-estimates boundary strength as the segmentation progresses, and applies Gestalt grouping principles using a conditional random field (CRF) model.
Abstract: Occlusion reasoning is a fundamental problem in computer vision. In this paper, we propose an algorithm to recover the occlusion boundaries and depth ordering of free-standing structures in the scene. Rather than viewing the problem as one of pure image processing, our approach employs cues from an estimated surface layout and applies Gestalt grouping principles using a conditional random field (CRF) model. We propose a hierarchical segmentation process, based on agglomerative merging, that re-estimates boundary strength as the segmentation progresses. Our experiments on the Geometric Context dataset validate our choices for features, our iterative refinement of classifiers, and our CRF model. In experiments on the Berkeley Segmentation Dataset, PASCAL VOC 2008, and LabelMe, we also show that the trained algorithm generalizes to other datasets and can be used as an object boundary predictor with figure/ground labels.

Journal ArticleDOI
TL;DR: A smoothing method based on the log-sum exponential function is developed and indicates that such a smoothing approach leads to a novel smoothed primal-dual model and suggests labelings with maximum entropy.
Abstract: This paper is devoted to the optimization problem of continuous multi-partitioning, or multi-labeling, which is based on a convex relaxation of the continuous Potts model. In contrast to previous efforts, which are tackling the optimal labeling problem in a direct manner, we first propose a novel dual model and then build up a corresponding duality-based approach. By analyzing the dual formulation, sufficient conditions are derived which show that the relaxation is often exact, i.e. there exists optimal solutions that are also globally optimal to the original nonconvex Potts model. In order to deal with the nonsmooth dual problem, we develop a smoothing method based on the log-sum exponential function and indicate that such a smoothing approach leads to a novel smoothed primal-dual model and suggests labelings with maximum entropy. Such a smoothing method for the dual model also yields a new thresholding scheme to obtain approximate solutions. An expectation maximization like algorithm is proposed based on the smoothed formulation which is shown to be superior in efficiency compared to earlier approaches from continuous optimization. Numerical experiments also show that our method outperforms several competitive approaches in various aspects, such as lower energies and better visual quality.

Journal ArticleDOI
TL;DR: The Dual Hierarchical Dirichlet Processes model is extended to a Dynamic Dual-HDP model which allows dynamic update of activity models and online detection of normal/abnormal activities.
Abstract: We propose a novel framework of using a nonparametric Bayesian model, called Dual Hierarchical Dirichlet Processes (Dual-HDP) (Wang et al. in IEEE Trans. Pattern Anal. Mach. Intell. 31:539---555, 2009), for unsupervised trajectory analysis and semantic region modeling in surveillance settings. In our approach, trajectories are treated as documents and observations of an object on a trajectory are treated as words in a document. Trajectories are clustered into different activities. Abnormal trajectories are detected as samples with low likelihoods. The semantic regions, which are subsets of paths commonly taken by objects and are related to activities in the scene, are also modeled. Under Dual-HDP, both the number of activity categories and the number of semantic regions are automatically learnt from data. In this paper, we further extend Dual-HDP to a Dynamic Dual-HDP model which allows dynamic update of activity models and online detection of normal/abnormal activities. Experiments are evaluated on a simulated data set and two real data sets, which include 8,478 radar tracks collected from a maritime port and 40,453 visual tracks collected from a parking lot.

Journal ArticleDOI
TL;DR: A monocular 3D reconstruction algorithm for inextensible deformable surfaces that uses point correspondences between a single image of the deformed surface taken by a camera with known intrinsic parameters and a template to recover the 3D surface shape as seen in the image.
Abstract: We present a monocular 3D reconstruction algorithm for inextensible deformable surfaces. It uses point correspondences between a single image of the deformed surface taken by a camera with known intrinsic parameters and a template. The main assumption we make is that the surface shape as seen in the template is known. Since the surface is inextensible, its deformations are isometric to the template. We exploit the distance preservation constraints to recover the 3D surface shape as seen in the image. Though the distance preservation constraints have already been investigated in the literature, we propose a new way to handle them. Spatial smoothness priors are easily incorporated, as well as temporal smoothness priors in the case of reconstruction from a video. The reconstruction can be used for 3D augmented reality purposes thanks to a fast implementation. We report results on synthetic and real data. Some of them are compared to stereo-based 3D reconstructions to demonstrate the efficiency of our method.

Journal ArticleDOI
TL;DR: A criterion for evaluating a pair of apertures with respect to the precision of depth recovery is derived and these two coded aperture are found to complement each other in the scene frequencies they preserve.
Abstract: The classical approach to depth from defocus (DFD) uses lenses with circular apertures for image capturing. We show in this paper that the use of a circular aperture severely restricts the accuracy of DFD. We derive a criterion for evaluating a pair of apertures with respect to the precision of depth recovery. This criterion is optimized using a genetic algorithm and gradient descent search to arrive at a pair of high resolution apertures. These two coded apertures are found to complement each other in the scene frequencies they preserve. This property enables them to not only recover depth with greater fidelity but also obtain a high quality all-focused image from the two captured images. Extensive simulations as well as experiments on a variety of real scenes demonstrate the benefits of using the coded apertures over conventional circular apertures.

Journal ArticleDOI
TL;DR: This paper proposes a discriminative semi-Markov model approach, and defines a set of features over boundary frames, segments, as well as neighboring segments that enable it to conveniently capture a combination of local and global features that best represent each specific action type.
Abstract: A challenging problem in human action understanding is to jointly segment and recognize human actions from an unseen video sequence, where one person performs a sequence of continuous actions. In this paper, we propose a discriminative semi-Markov model approach, and define a set of features over boundary frames, segments, as well as neighboring segments. This enable us to conveniently capture a combination of local and global features that best represent each specific action type. To efficiently solve the inference problem of simultaneous segmentation and recognition, a Viterbi-like dynamic programming algorithm is utilized, which in practice is able to process 20 frames per second. Moreover, the model is discriminatively learned from large margin principle, and is formulated as an optimization problem with exponentially many constraints. To solve it efficiently, we present two different optimization algorithms, namely cutting plane method and bundle method, and demonstrate that each can be alternatively deployed in a "plug and play" fashion. From its theoretical aspect, we also analyze the generalization error of the proposed approach and provide a PAC-Bayes bound. The proposed approach is evaluated on a variety of datasets, and is shown to perform competitively to the state-of-the-art methods. For example, on KTH dataset, it achieves 95.0% recognition accuracy, where the best known result on this dataset is 93.4% (Reddy and Shah in ICCV, 2009).

Journal ArticleDOI
TL;DR: Experimental evaluations against state-of-the-art algorithms demonstrate the promise and effectiveness of the proposed incremental tensor subspace learning algorithm, and its applications to foreground segmentation and object tracking.
Abstract: Appearance modeling is very important for background modeling and object tracking. Subspace learning-based algorithms have been used to model the appearances of objects or scenes. Current vector subspace-based algorithms cannot effectively represent spatial correlations between pixel values. Current tensor subspace-based algorithms construct an offline representation of image ensembles, and current online tensor subspace learning algorithms cannot be applied to background modeling and object tracking. In this paper, we propose an online tensor subspace learning algorithm which models appearance changes by incrementally learning a tensor subspace representation through adaptively updating the sample mean and an eigenbasis for each unfolding matrix of the tensor. The proposed incremental tensor subspace learning algorithm is applied to foreground segmentation and object tracking for grayscale and color image sequences. The new background models capture the intrinsic spatiotemporal characteristics of scenes. The new tracking algorithm captures the appearance characteristics of an object during tracking and uses a particle filter to estimate the optimal object state. Experimental evaluations against state-of-the-art algorithms demonstrate the promise and effectiveness of the proposed incremental tensor subspace learning algorithm, and its applications to foreground segmentation and object tracking.

Journal ArticleDOI
TL;DR: A new robust approach for 3D face registration to an intrinsic coordinate system of the face that is much faster than other methods, taking only 2.5 seconds per image for registration and less than 0.1 ms per comparison.
Abstract: In this paper we present a new robust approach for 3D face registration to an intrinsic coordinate system of the face. The intrinsic coordinate system is defined by the vertical symmetry plane through the nose, the tip of the nose and the slope of the bridge of the nose. In addition, we propose a 3D face classifier based on the fusion of many dependent region classifiers for overlapping face regions. The region classifiers use PCA-LDA for feature extraction and the likelihood ratio as a matching score. Fusion is realised using straightforward majority voting for the identification scenario. For verification, a voting approach is used as well and the decision is defined by comparing the number of votes to a threshold. Using the proposed registration method combined with a classifier consisting of 60 fused region classifiers we obtain a 99.0% identification rate on the all vs first identification test of the FRGC v2 data. A verification rate of 94.6% at FAR=0.1% was obtained for the all vs all verification test on the FRGC v2 data using fusion of 120 region classifiers. The first is the highest reported performance and the second is in the top-5 of best performing systems on these tests. In addition, our approach is much faster than other methods, taking only 2.5 seconds per image for registration and less than 0.1 ms per comparison. Because we apply feature extraction using PCA and LDA, the resulting template size is also very small: 6 kB for 60 region classifiers.

Journal ArticleDOI
TL;DR: In this paper, a Gaussian distribution is used to model a homogeneously textured region of a natural image and the region boundary can be effectively coded by an adaptive chain code.
Abstract: We present a novel algorithm for segmentation of natural images that harnesses the principle of minimum description length (MDL). Our method is based on observations that a homogeneously textured region of a natural image can be well modeled by a Gaussian distribution and the region boundary can be effectively coded by an adaptive chain code. The optimal segmentation of an image is the one that gives the shortest coding length for encoding all textures and boundaries in the image, and is obtained via an agglomerative clustering process applied to a hierarchy of decreasing window sizes as multi-scale texture features. The optimal segmentation also provides an accurate estimate of the overall coding length and hence the true entropy of the image. We test our algorithm on the publicly available Berkeley Segmentation Dataset. It achieves state-of-the-art segmentation results compared to other existing methods.

Journal ArticleDOI
TL;DR: Experimental results show that the stochastic methodology described in this paper recognizes a wide range of group activities more reliably and accurately, as compared to previous approaches.
Abstract: This paper describes a stochastic methodology for the recognition of various types of high-level group activities. Our system maintains a probabilistic representation of a group activity, describing how individual activities of its group members must be organized temporally, spatially, and logically. In order to recognize each of the represented group activities, our system searches for a set of group members that has the maximum posterior probability of satisfying its representation. A hierarchical recognition algorithm utilizing a Markov chain Monte Carlo (MCMC)-based probability distribution sampling has been designed, detecting group activities and finding the acting groups simultaneously. The system has been tested to recognize complex activities such as `a group of thieves stealing an object from another group' and `a group assaulting a person'. Videos downloaded from YouTube as well as videos that we have taken are tested. Experimental results show that our system recognizes a wide range of group activities more reliably and accurately, as compared to previous approaches.

Journal ArticleDOI
TL;DR: In the proposed filter, it is shown that a suitable choice for a set of diffusivity functions to unevenly control the strength of the diffusion on the directions of the level set and gradient leads to a good edge preservation capability compared to the other diffusion or regularization filters.
Abstract: Fourth-order nonlinear diffusion filters used for image noise removal are mainly isotropic filters. It means that the spatially varying diffusivity determined by a diffusion function is applied on the image regardless of the orientation of its local features. However, the optimal choice of parameters in the numerical solver of these filters for having a minimal distortion of the image features results in forming speckle noise on the denoised image and a very slow convergence rate especially when the noise level is moderately high. In this paper, a new fourth-order nonlinear diffusion filter is introduced, which has an anisotropic behavior on the image features. In the proposed filter, it is shown that a suitable choice for a set of diffusivity functions to unevenly control the strength of the diffusion on the directions of the level set and gradient leads to a good edge preservation capability compared to the other diffusion or regularization filters. The comparison of the results obtained by the proposed filter with those of the other second and fourth-order filters shows that the proposed method produces a noticeable improvement in the quality of denoised images evaluated subjectively and quantitatively.

Journal ArticleDOI
TL;DR: An algorithm that allows users to decide what foreground is, and then guide the output of the co-segmentation algorithm towards it via scribbles, which shows that keeping a user in the loop leads to simpler and highly parallelizable energy functions, allowing us to work with significantly more images per group.
Abstract: We present an algorithm for Interactive Co-segmentation of a foreground object from a group of related images. While previous works in co-segmentation have focussed on unsupervised co-segmentation, we use successful ideas from the interactive object-cutout literature. We develop an algorithm that allows users to decide what foreground is, and then guide the output of the co-segmentation algorithm towards it via scribbles. Interestingly, keeping a user in the loop leads to simpler and highly parallelizable energy functions, allowing us to work with significantly more images per group. However, unlike the interactive single-image counterpart, a user cannot be expected to exhaustively examine all cutouts (from tens of images) returned by the system to make corrections. Hence, we propose iCoseg, an automatic recommendation system that intelligently recommends where the user should scribble next. We introduce and make publicly available the largest co-segmentation dataset yet, the CMU-Cornell iCoseg dataset, with 38 groups, 643 images, and pixelwise hand-annotated groundtruth. Through machine experiments and real user studies with our developed interface, we show that iCoseg can intelligently recommend regions to scribble on, and users following these recommendations can achieve good quality cutouts with significantly lower time and effort than exhaustively examining all cutouts.

Journal ArticleDOI
TL;DR: A more accurate localization criterion is provided and the optimal detector is derived from it, which implies that edge detection must be performed at multiple scales to cover all the blur widths in the image.
Abstract: Canny (IEEE Trans Pattern Anal Image Proc 8(6):679-698, 1986) suggested that an optimal edge detector should maximize both signal-to-noise ratio and localization, and he derived mathematical expressions for these criteria Based on these criteria, he claimed that the optimal step edge detector was similar to a derivative of a gaussian However, Canny's work suffers from two problems First, his derivation of localization criterion is incorrect Here we provide a more accurate localization criterion and derive the optimal detector from it Second, and more seriously, the Canny criteria yield an infinitely wide optimal edge detector The width of the optimal detector can however be limited by considering the effect of the neighbouring edges in the image If we do so, we find that the optimal step edge detector, according to the Canny criteria, is the derivative of an ISEF filter, proposed by Shen and Castan (Graph Models Image Proc 54:112---133, 1992) In addition, if we also consider detecting blurred (or non-sharp) gaussian edges of different widths, we find that the optimal blurred-edge detector is the above optimal step edge detector convolved with a gaussian This implies that edge detection must be performed at multiple scales to cover all the blur widths in the image We derive a simple scale selection procedure for edge detection, and demonstrate it in one and two dimensions

Journal ArticleDOI
TL;DR: Left-invariant diffusion on the group of 3D rigid body movements SE(3) and its application to crossing-preserving smoothing of HARDI images is studied.
Abstract: HARDI (High Angular Resolution Diffusion Imaging) is a recent magnetic resonance imaging (MRI) technique for imaging water diffusion processes in fibrous tissues such as brain white matter and muscles. In this article we study left-invariant diffusion on the group of 3D rigid body movements (i.e. 3D Euclidean motion group) SE(3) and its application to crossing-preserving smoothing of HARDI images. The linear left-invariant (convection-)diffusions are forward Kolmogorov equations of Brownian motions on the space of positions and orientations in 3D embedded in SE(3) and can be solved by ?3 ? S 2-convolution with the corresponding Green's functions. We provide analytic approximation formulas and explicit sharp Gaussian estimates for these Green's functions. In our design and analysis for appropriate (nonlinear) convection-diffusions on HARDI data we explain the underlying differential geometry on SE(3). We write our left-invariant diffusions in covariant derivatives on SE(3) using the Cartan connection. This Cartan connection has constant curvature and constant torsion, and so have the exponential curves which are the auto-parallels along which our left-invariant diffusion takes place. We provide experiments of our crossing-preserving Euclidean-invariant diffusions on artificial HARDI data containing crossing-fibers.

Journal ArticleDOI
TL;DR: This paper proposes a technique which is able to efficiently compute a high-quality scene representation via graph-cut optimisation of an energy function combining multiple image cues with strong priors in a view-dependent manner with respect to each input camera.
Abstract: Current state-of-the-art image-based scene reconstruction techniques are capable of generating high-fidelity 3D models when used under controlled capture conditions. However, they are often inadequate when used in more challenging environments such as sports scenes with moving cameras. Algorithms must be able to cope with relatively large calibration and segmentation errors as well as input images separated by a wide-baseline and possibly captured at different resolutions. In this paper, we propose a technique which, under these challenging conditions, is able to efficiently compute a high-quality scene representation via graph-cut optimisation of an energy function combining multiple image cues with strong priors. Robustness is achieved by jointly optimising scene segmentation and multiple view reconstruction in a view-dependent manner with respect to each input camera. Joint optimisation prevents propagation of errors from segmentation to reconstruction as is often the case with sequential approaches. View-dependent processing increases tolerance to errors in through-the-lens calibration compared to global approaches. We evaluate our technique in the case of challenging outdoor sports scenes captured with manually operated broadcast cameras as well as several indoor scenes with natural background. A comprehensive experimental evaluation including qualitative and quantitative results demonstrates the accuracy of the technique for high quality segmentation and reconstruction and its suitability for free-viewpoint video under these difficult conditions.

Journal ArticleDOI
TL;DR: A function for predicting the importance of each object directly from a segmented image is fit and it is found that object position and size are particularly informative, while a popular measure of saliency is not.
Abstract: How important is a particular object in a photograph of a complex scene? We propose a definition of importance and present two methods for measuring object importance from human observers. Using this ground truth, we fit a function for predicting the importance of each object directly from a segmented image; our function combines a large number of object-related and image-related features. We validate our importance predictions on 2,841 objects and find that the most important objects may be identified automatically. We find that object position and size are particularly informative, while a popular measure of saliency is not.