scispace - formally typeset
Search or ask a question

Showing papers in "International Journal of Computer Vision in 2013"


Journal ArticleDOI
TL;DR: This paper introduces selective search which combines the strength of both an exhaustive search and segmentation, and shows that its selective search enables the use of the powerful Bag-of-Words model for recognition.
Abstract: This paper addresses the problem of generating possible object locations for use in object recognition. We introduce selective search which combines the strength of both an exhaustive search and segmentation. Like segmentation, we use the image structure to guide our sampling process. Like exhaustive search, we aim to capture all possible object locations. Instead of a single technique to generate possible object locations, we diversify our search and use a variety of complementary image partitionings to deal with as many image conditions as possible. Our selective search results in a small set of data-driven, class-independent, high quality locations, yielding 99 % recall and a Mean Average Best Overlap of 0.879 at 10,097 locations. The reduced number of locations compared to an exhaustive search enables the use of stronger machine learning techniques and stronger appearance models for object recognition. In this paper we show that our selective search enables the use of the powerful Bag-of-Words model for recognition. The selective search software is made publicly available (Software: http://disi.unitn.it/~uijlings/SelectiveSearch.html ).

5,843 citations


Journal ArticleDOI
TL;DR: The MBH descriptor shows to consistently outperform other state-of-the-art descriptors, in particular on real-world videos that contain a significant amount of camera motion.
Abstract: This paper introduces a video representation based on dense trajectories and motion boundary descriptors. Trajectories capture the local motion information of the video. A dense representation guarantees a good coverage of foreground motion as well as of the surrounding context. A state-of-the-art optical flow algorithm enables a robust and efficient extraction of dense trajectories. As descriptors we extract features aligned with the trajectories to characterize shape (point coordinates), appearance (histograms of oriented gradients) and motion (histograms of optical flow). Additionally, we introduce a descriptor based on motion boundary histograms (MBH) which rely on differential optical flow. The MBH descriptor shows to consistently outperform other state-of-the-art descriptors, in particular on real-world videos that contain a significant amount of camera motion. We evaluate our video representation in the context of action classification on nine datasets, namely KTH, YouTube, Hollywood2, UCF sports, IXMAS, UIUC, Olympic Sports, UCF50 and HMDB51. On all datasets our approach outperforms current state-of-the-art results.

1,726 citations


Journal ArticleDOI
TL;DR: This work proposes to use the Fisher Kernel framework as an alternative patch encoding strategy: it describes patches by their deviation from an “universal” generative Gaussian mixture model, and reports experimental results showing that the FV framework is a state-of-the-art patch encoding technique.
Abstract: A standard approach to describe an image for classification and retrieval purposes is to extract a set of local patch descriptors, encode them into a high dimensional vector and pool them into an image-level signature The most common patch encoding strategy consists in quantizing the local descriptors into a finite set of prototypical elements This leads to the popular Bag-of-Visual words representation In this work, we propose to use the Fisher Kernel framework as an alternative patch encoding strategy: we describe patches by their deviation from an "universal" generative Gaussian mixture model This representation, which we call Fisher vector has many advantages: it is efficient to compute, it leads to excellent results even with efficient linear classifiers, and it can be compressed with a minimal loss of accuracy using product quantization We report experimental results on five standard datasets--PASCAL VOC 2007, Caltech 256, SUN 397, ILSVRC 2010 and ImageNet10K--with up to 9M images and 10K classes, showing that the FV framework is a state-of-the-art patch encoding technique

1,594 citations


Journal ArticleDOI
TL;DR: It is argued that video annotation requires specialized skill; most workers are poor annotators, mandating robust quality control protocols and an inherent trade-off between the mix of human and cloud computing used vs. the accuracy and cost of the labeling.
Abstract: We present an extensive three year study on economically annotating video with crowdsourced marketplaces. Our public framework has annotated thousands of real world videos, including massive data sets unprecedented for their size, complexity, and cost. To accomplish this, we designed a state-of-the-art video annotation user interface and demonstrate that, despite common intuition, many contemporary interfaces are sub-optimal. We present several user studies that evaluate different aspects of our system and demonstrate that minimizing the cognitive load of the user is crucial when designing an annotation platform. We then deploy this interface on Amazon Mechanical Turk and discover expert and talented workers who are capable of annotating difficult videos with dense and closely cropped labels. We argue that video annotation requires specialized skill; most workers are poor annotators, mandating robust quality control protocols. We show that traditional crowdsourced micro-tasks are not suitable for video annotation and instead demonstrate that deploying time-consuming macro-tasks on MTurk is effective. Finally, we show that by extracting pixel-based features from manually labeled key frames, we are able to leverage more sophisticated interpolation strategies to maximize performance given a fixed budget. We validate the power of our framework on difficult, real-world data sets and we demonstrate an inherent trade-off between the mix of human and cloud computing used vs. the accuracy and cost of the labeling. We further introduce a novel, cost-based evaluation criteria that compares vision algorithms by the budget required to achieve an acceptable performance. We hope our findings will spur innovation in the creation of massive labeled video data sets and enable novel data-driven computer vision applications.

529 citations


Journal ArticleDOI
TL;DR: A random forest-based framework for real time head pose estimation from depth images and extend it to localize a set of facial features in 3D to achieve real time performance without resorting to parallel computations on a GPU.
Abstract: We present a random forest-based framework for real time head pose estimation from depth images and extend it to localize a set of facial features in 3D. Our algorithm takes a voting approach, where each patch extracted from the depth image can directly cast a vote for the head pose or each of the facial features. Our system proves capable of handling large rotations, partial occlusions, and the noisy depth data acquired using commercial sensors. Moreover, the algorithm works on each frame independently and achieves real time performance without resorting to parallel computations on a GPU. We present extensive experiments on publicly available, challenging datasets and present a new annotated head pose database recorded using a Microsoft Kinect.

504 citations


Journal ArticleDOI
TL;DR: Rotational Projection Statistics (RoPS) as discussed by the authors is a feature descriptor that is obtained by rotationally projecting the neighboring points of a feature point onto 2D planes and calculating a set of statistics including low-order central moments and entropy of the distribution of these projected points.
Abstract: Recognizing 3D objects in the presence of noise, varying mesh resolution, occlusion and clutter is a very challenging task. This paper presents a novel method named Rotational Projection Statistics (RoPS). It has three major modules: local reference frame (LRF) definition, RoPS feature description and 3D object recognition. We propose a novel technique to define the LRF by calculating the scatter matrix of all points lying on the local surface. RoPS feature descriptors are obtained by rotationally projecting the neighboring points of a feature point onto 2D planes and calculating a set of statistics (including low-order central moments and entropy) of the distribution of these projected points. Using the proposed LRF and RoPS descriptor, we present a hierarchical 3D object recognition algorithm. The performance of the proposed LRF, RoPS descriptor and object recognition algorithm was rigorously tested on a number of popular and publicly available datasets. Our proposed techniques exhibited superior performance compared to existing techniques. We also showed that our method is robust with respect to noise and varying mesh resolution. Our RoPS based algorithm achieved recognition rates of 100, 98.9, 95.4 and 96.0 % respectively when tested on the Bologna, UWA, Queen’s and Ca’ Foscari Venezia Datasets.

437 citations


Journal ArticleDOI
TL;DR: The results demonstrate that mining the interdependencies between particles improves tracking performance and overall computational complexity, and both methods consistently outperform state-of-the-art trackers.
Abstract: In this paper, we formulate object tracking in a particle filter framework as a structured multi-task sparse learning problem, which we denote as Structured Multi-Task Tracking (S-MTT). Since we model particles as linear combinations of dictionary templates that are updated dynamically, learning the representation of each particle is considered a single task in Multi-Task Tracking (MTT). By employing popular sparsity-inducing $$\ell _{p,q}$$ mixed norms $$(\text{ specifically} p\in \{2,\infty \}$$ and $$q=1),$$ we regularize the representation problem to enforce joint sparsity and learn the particle representations together. As compared to previous methods that handle particles independently, our results demonstrate that mining the interdependencies between particles improves tracking performance and overall computational complexity. Interestingly, we show that the popular $$L_1$$ tracker (Mei and Ling, IEEE Trans Pattern Anal Mach Intel 33(11):2259---2272, 2011) is a special case of our MTT formulation (denoted as the $$L_{11}$$ tracker) when $$p=q=1.$$ Under the MTT framework, some of the tasks (particle representations) are often more closely related and more likely to share common relevant covariates than other tasks. Therefore, we extend the MTT framework to take into account pairwise structural correlations between particles (e.g. spatial smoothness of representation) and denote the novel framework as S-MTT. The problem of learning the regularized sparse representation in MTT and S-MTT can be solved efficiently using an Accelerated Proximal Gradient (APG) method that yields a sequence of closed form updates. As such, S-MTT and MTT are computationally attractive. We test our proposed approach on challenging sequences involving heavy occlusion, drastic illumination changes, and large pose variations. Experimental results show that S-MTT is much better than MTT, and both methods consistently outperform state-of-the-art trackers.

434 citations


Journal ArticleDOI
TL;DR: A categorization of existing methods in two classes, that allows for highlighting their common traits, is proposed, so as to abstract all algorithms to two general structures in terms of repeatability, distinctiveness and computational efficiency.
Abstract: In the past few years detection of repeatable and distinctive keypoints on 3D surfaces has been the focus of intense research activity, due on the one hand to the increasing diffusion of low-cost 3D sensors, on the other to the growing importance of applications such as 3D shape retrieval and 3D object recognition. This work aims at contributing to the maturity of this field by a thorough evaluation of several recent 3D keypoint detectors. A categorization of existing methods in two classes, that allows for highlighting their common traits, is proposed, so as to abstract all algorithms to two general structures. Moreover, a comprehensive experimental evaluation is carried out in terms of repeatability, distinctiveness and computational efficiency, based on a vast data corpus characterized by nuisances such as noise, clutter, occlusions and viewpoint changes.

361 citations


Journal ArticleDOI
TL;DR: In this paper, a Markov Random Field (MRF) is used to combine scene-level matching with global image descriptors, followed by superpixel level matching with local features and efficient MRF optimization for incorporating neighborhood context.
Abstract: This paper presents a simple and effective nonparametric approach to the problem of image parsing, or labeling image regions (in our case, superpixels produced by bottom-up segmentation) with their categories. This approach requires no training, and it can easily scale to datasets with tens of thousands of images and hundreds of labels. It works by scene-level matching with global image descriptors, followed by superpixel-level matching with local features and efficient Markov random field (MRF) optimization for incorporating neighborhood context. Our MRF setup can also compute a simultaneous labeling of image regions into semantic classes (e.g., tree, building, car) and geometric classes (sky, vertical, ground). Our system outperforms the state-of-the-art non-parametric method based on SIFT Flow on a dataset of 2,688 images and 33 labels. In addition, we report per-pixel rates on a larger dataset of 15,150 images and 170 labels. To our knowledge, this is the first complete evaluation of image parsing on a dataset of this size, and it establishes a new benchmark for the problem.

349 citations


Journal ArticleDOI
TL;DR: A latency-aware learning formulation is used to train a logistic regression-based classifier that automatically determines distinctive canonical poses from data and uses these to robustly recognize actions in the presence of ambiguous poses.
Abstract: An important aspect in designing interactive, action-based interfaces is reliably recognizing actions with minimal latency. High latency causes the system's feedback to lag behind user actions and thus significantly degrades the interactivity of the user experience. This paper presents algorithms for reducing latency when recognizing actions. We use a latency-aware learning formulation to train a logistic regression-based classifier that automatically determines distinctive canonical poses from data and uses these to robustly recognize actions in the presence of ambiguous poses. We introduce a novel (publicly released) dataset for the purpose of our experiments. Comparisons of our method against both a Bag of Words and a Conditional Random Field (CRF) classifier show improved recognition performance for both pre-segmented and online classification tasks. Additionally, we employ GentleBoost to reduce our feature set and further improve our results. We then present experiments that explore the accuracy/latency trade-off over a varying number of actions. Finally, we evaluate our algorithm on two existing datasets.

262 citations


Journal ArticleDOI
TL;DR: Specific examples are given for a set of synthetically generated rotation data and an application to analyzing shape changes in the corpus callosum due to age, which can be generally applied to data on any manifold.
Abstract: This paper develops the theory of geodesic regression and least-squares estimation on Riemannian manifolds. Geodesic regression is a method for finding the relationship between a real-valued independent variable and a manifold-valued dependent random variable, where this relationship is modeled as a geodesic curve on the manifold. Least-squares estimation is formulated intrinsically as a minimization of the sum-of-squared geodesic distances of the data to the estimated model. Geodesic regression is a direct generalization of linear regression to the manifold setting, and it provides a simple parameterization of the estimated relationship as an initial point and velocity, analogous to the intercept and slope. A nonparametric permutation test for determining the significance of the trend is also given. For the case of symmetric spaces, two main theoretical results are established. First, conditions for existence and uniqueness of the least-squares problem are provided. Second, a maximum likelihood criteria is developed for a suitable definition of Gaussian errors on the manifold. While the method can be generally applied to data on any manifold, specific examples are given for a set of synthetically generated rotation data and an application to analyzing shape changes in the corpus callosum due to age.

Journal ArticleDOI
TL;DR: A system for the detection, segmentation and recognition of multi-class hand postures against complex natural backgrounds using a Bayesian model of visual attention to generate a saliency map, and to detect and identify the hand region.
Abstract: A system for the detection, segmentation and recognition of multi-class hand postures against complex natural backgrounds is presented. Visual attention, which is the cognitive process of selectively concentrating on a region of interest in the visual field, helps human to recognize objects in cluttered natural scenes. The proposed system utilizes a Bayesian model of visual attention to generate a saliency map, and to detect and identify the hand region. Feature based visual attention is implemented using a combination of high level (shape, texture) and low level (color) image features. The shape and texture features are extracted from a skin similarity map, using a computational model of the ventral stream of visual cortex. The skin similarity map, which represents the similarity of each pixel to the human skin color in HSI color space, enhanced the edges and shapes within the skin colored regions. The color features used are the discretized chrominance components in HSI, YCbCr color spaces, and the similarity to skin map. The hand postures are classified using the shape and texture features, with a support vector machines classifier. A new 10 class complex background hand posture dataset namely NUS hand posture dataset-II is developed for testing the proposed algorithm (40 subjects, different ethnicities, various hand sizes, 2750 hand postures and 2000 background images). The algorithm is tested for hand detection and hand posture recognition using 10 fold cross-validation. The experimental results show that the algorithm has a person independent performance, and is reliable against variations in hand sizes and complex backgrounds. The algorithm provided a recognition rate of 94.36 %. A comparison of the proposed algorithm with other existing methods evidences its better performance.

Journal ArticleDOI
TL;DR: A system for recognizing material categories from single images based on studies of human material recognition is presented, and it is suggested that future progress in material recognition will come from a deeper understanding of the role of non-local surface properties and efforts to model such non- local surface properties in images.
Abstract: Our world consists not only of objects and scenes but also of materials of various kinds. Being able to recognize the materials that surround us (e.g., plastic, glass, concrete) is important for humans as well as for computer vision systems. Unfortunately, materials have received little attention in the visual recognition literature, and very few computer vision systems have been designed specifically to recognize materials. In this paper, we present a system for recognizing material categories from single images. We propose a set of low and mid-level image features that are based on studies of human material recognition, and we combine these features using an SVM classifier. Our system outperforms a state-of-the-art system (Varma and Zisserman, TPAMI 31(11):2032–2047, 2009) on a challenging database of real-world material categories (Sharan et al., J Vis 9(8):784–784a, 2009). When the performance of our system is compared directly to that of human observers, humans outperform our system quite easily. However, when we account for the local nature of our image features and the surface properties they measure (e.g., color, texture, local shape), our system rivals human performance. We suggest that future progress in material recognition will come from: (1) a deeper understanding of the role of non-local surface properties (e.g., extended highlights, object identity); and (2) efforts to model such non-local surface properties in images.

Journal ArticleDOI
TL;DR: A new transportation-related distance between pairs of images is described, which is described as linear optimal transportation (LOT), which can be used directly on pixel intensities, and is based on a linearized version of the Kantorovich-Wasserstein metric.
Abstract: Transportation-based metrics for comparing images have long been applied to analyze images, especially where one can interpret the pixel intensities (or derived quantities) as a distribution of `mass' that can be transported without strict geometric constraints. Here we describe a new transportation-based framework for analyzing sets of images. More specifically, we describe a new transportation-related distance between pairs of images, which we denote as linear optimal transportation (LOT). The LOT can be used directly on pixel intensities, and is based on a linearized version of the Kantorovich-Wasserstein metric (an optimal transportation distance, as is the earth mover's distance). The new framework is especially well suited for computing all pairwise distances for a large database of images efficiently, and thus it can be used for pattern recognition in sets of images. In addition, the new LOT framework also allows for an isometric linear embedding, greatly facilitating the ability to visualize discriminant information in different classes of images. We demonstrate the application of the framework to several tasks such as discriminating nuclear chromatin patterns in cancer cells, decoding differences in facial expressions, galaxy morphologies, as well as sub cellular protein distributions.

Journal ArticleDOI
TL;DR: This paper exploits the high correlation between 2D trajectories of different points on the same non-rigid surface by assuming that the displacement of any point throughout the sequence can be expressed in a compact way as a linear combination of a low-rank motion basis.
Abstract: This paper addresses the problem of non-rigid video registration, or the computation of optical flow from a reference frame to each of the subsequent images in a sequence, when the camera views deformable objects. We exploit the high correlation between 2D trajectories of different points on the same non-rigid surface by assuming that the displacement of any point throughout the sequence can be expressed in a compact way as a linear combination of a low-rank motion basis. This subspace constraint effectively acts as a trajectory regularization term leading to temporally consistent optical flow. We formulate it as a robust soft constraint within a variational framework by penalizing flow fields that lie outside the low-rank manifold. The resulting energy functional can be decoupled into the optimization of the brightness constancy and spatial regularization terms, leading to an efficient optimization scheme. Additionally, we propose a novel optimization scheme for the case of vector valued images, based on the dualization of the data term. This allows us to extend our approach to deal with colour images which results in significant improvements on the registration results. Finally, we provide a new benchmark dataset, based on motion capture data of a flag waving in the wind, with dense ground truth optical flow for evaluation of multi-frame optical flow algorithms for non-rigid surfaces. Our experiments show that our proposed approach outperforms state of the art optical flow and dense non-rigid registration algorithms.

Journal ArticleDOI
TL;DR: The main contribution of this work is the fusion of a 3D representation and an advanced variational framework that directly uses the available multi-view information to advantageously bind the 3D unknowns in time and space.
Abstract: We present a novel method for recovering the 3D structure and scene flow from calibrated multi-view sequences. We propose a 3D point cloud parametrization of the 3D structure and scene flow that allows us to directly estimate the desired unknowns. A unified global energy functional is proposed to incorporate the information from the available sequences and simultaneously recover both depth and scene flow. The functional enforces multi-view geometric consistency and imposes brightness constancy and piecewise smoothness assumptions directly on the 3D unknowns. It inherently handles the challenges of discontinuities, occlusions, and large displacements. The main contribution of this work is the fusion of a 3D representation and an advanced variational framework that directly uses the available multi-view information. This formulation allows us to advantageously bind the 3D unknowns in time and space. Different from optical flow and disparity, the proposed method results in a nonlinear mapping between the images' coordinates, thus giving rise to additional challenges in the optimization process. Our experiments on real and synthetic data demonstrate that the proposed method successfully recovers the 3D structure and scene flow despite the complicated nonconvex optimization problem.

Journal ArticleDOI
Peng Wang1, Gang Zeng1, Rui Gan1, Jingdong Wang2, Hongbin Zha1 
TL;DR: This paper describes the structure-sensitive superpixel technique by exploiting Lloyd’s algorithm with the geodesic distance, and generates smaller superpixels to achieve relatively low under-segmentation in structure-dense regions with high intensity or color variation, and produces larger segments to increase computational efficiency inructure-sparse regions with homogeneous appearance.
Abstract: Segmenting images into superpixels as supporting regions for feature vectors and primitives to reduce computational complexity has been commonly used as a fundamental step in various image analysis and computer vision tasks. In this paper, we describe the structure-sensitive superpixel technique by exploiting Lloyd’s algorithm with the geodesic distance. Our method generates smaller superpixels to achieve relatively low under-segmentation in structure-dense regions with high intensity or color variation, and produces larger segments to increase computational efficiency in structure-sparse regions with homogeneous appearance. We adopt geometric flows to compute geodesic distances amongst pixels. In the segmentation procedure, the density of over-segments is automatically adjusted through iteratively optimizing an energy functional that embeds color homogeneity, structure density. Comparative experiments with the Berkeley database show that the proposed algorithm outperforms the prior arts while offering a comparable computational efficiency as TurboPixels. Further applications in image compression, object closure extraction and video segmentation demonstrate the effective extensions of our approach.

Journal ArticleDOI
TL;DR: A comprehensive survey of geometric and topological coverage models for camera networks from the literature is presented and the properties of a hypothetical inclusively general model of each type are derived.
Abstract: Modeling the coverage of a sensor network is an important step in a number of design and optimization techniques. The nature of vision sensors presents unique challenges in deriving such models for camera networks. A comprehensive survey of geometric and topological coverage models for camera networks from the literature is presented. The models are analyzed and compared in the context of their intended applications, and from this treatment the properties of a hypothetical inclusively general model of each type are derived.

Journal ArticleDOI
TL;DR: This work addresses the problem of automatically detecting a sparse set of 3D mesh vertices, likely to be good candidates for determining correspondences, even on soft organic objects, with a machine-learning approach that achieves state-of-the-art performance while being highly generic.
Abstract: We address the problem of automatically detecting a sparse set of 3D mesh vertices, likely to be good candidates for determining correspondences, even on soft organic objects. We focus on 3D face scans, on which single local shape descriptor responses are known to be weak, sparse or noisy. Our machine-learning approach consists of computing feature vectors containing $$D$$ different local surface descriptors. These vectors are normalized with respect to the learned distribution of those descriptors for some given target shape (landmark) of interest. Then, an optimal function of this vector is extracted that best separates this particular target shape from its surrounding region within the set of training data. We investigate two alternatives for this optimal function: a linear method, namely Linear Discriminant Analysis, and a non-linear method, namely AdaBoost. We evaluate our approach by landmarking 3D face scans in the FRGC v2 and Bosphorus 3D face datasets. Our system achieves state-of-the-art performance while being highly generic.

Journal ArticleDOI
TL;DR: The experiments demonstrate that incorporating color information considerably improves recognition performance, and that a descriptor based on color names outperforms pure color descriptors and that late fusion of color and shape information outperforms other approaches on action recognition.
Abstract: In this article we investigate the problem of human action recognition in static images. By action recognition we intend a class of problems which includes both action classification and action detection (i.e. simultaneous localization and classification). Bag-of-words image representations yield promising results for action classification, and deformable part models perform very well object detection. The representations for action recognition typically use only shape cues and ignore color information. Inspired by the recent success of color in image classification and object detection, we investigate the potential of color for action classification and detection in static images. We perform a comprehensive evaluation of color descriptors and fusion approaches for action recognition. Experiments were conducted on the three datasets most used for benchmarking action recognition in still images: Willow, PASCAL VOC 2010 and Stanford-40. Our experiments demonstrate that incorporating color information considerably improves recognition performance, and that a descriptor based on color names outperforms pure color descriptors. Our experiments demonstrate that late fusion of color and shape information outperforms other approaches on action recognition. Finally, we show that the different color---shape fusion approaches result in complementary information and combining them yields state-of-the-art performance for action classification.

Journal ArticleDOI
TL;DR: This paper designs binary structured light patterns that are resilient to individual indirect illumination effects using simple logical operations and tools from combinatorial mathematics, and presents a practical 3D scanning system which works in the presence of a broad range of indirect illumination.
Abstract: Global or indirect illumination effects such as interreflections and subsurface scattering severely degrade the performance of structured light-based 3D scanning. In this paper, we analyze the errors in structured light, caused by both long-range (interreflections) and short-range (subsurface scattering) indirect illumination. The errors depend on the frequency of the projected patterns, and the nature of indirect illumination. In particular, we show that long-range effects cause decoding errors for low-frequency patterns, whereas short-range effects affect high-frequency patterns. Based on this analysis, we present a practical 3D scanning system which works in the presence of a broad range of indirect illumination. First, we design binary structured light patterns that are resilient to individual indirect illumination effects using simple logical operations and tools from combinatorial mathematics. Scenes exhibiting multiple phenomena are handled by combining results from a small ensemble of such patterns. This combination also allows detecting any residual errors that are corrected by acquiring a few additional images. Our methods can be readily incorporated into existing scanning systems without significant overhead in terms of capture time or hardware. We show results for several scenes with complex shape and material properties.

Journal ArticleDOI
TL;DR: Group differences may be better characterized by a different speed of maturation rather than shape differences at a given age, and this method is applied to analyze the differences in the growth of the hippocampus in children diagnosed with autism, developmental delays and in controls.
Abstract: This paper proposes an original approach for the statistical analysis of longitudinal shape data. The proposed method allows the characterization of typical growth patterns and subject-specific shape changes in repeated time-series observations of several subjects. This can be seen as the extension of usual longitudinal statistics of scalar measurements to high-dimensional shape or image data. The method is based on the estimation of continuous subject-specific growth trajectories and the comparison of such temporal shape changes across subjects. Differences between growth trajectories are decomposed into morphological deformations, which account for shape changes independent of the time, and time warps, which account for different rates of shape changes over time. Given a longitudinal shape data set, we estimate a mean growth scenario representative of the population, and the variations of this scenario both in terms of shape changes and in terms of change in growth speed. Then, intrinsic statistics are derived in the space of spatiotemporal deformations, which characterize the typical variations in shape and in growth speed within the studied population. They can be used to detect systematic developmental delays across subjects. In the context of neuroscience, we apply this method to analyze the differences in the growth of the hippocampus in children diagnosed with autism, developmental delays and in controls. Result suggest that group differences may be better characterized by a different speed of maturation rather than shape differences at a given age. In the context of anthropology, we assess the differences in the typical growth of the endocranium between chimpanzees and bonobos. We take advantage of this study to show the robustness of the method with respect to change of parameters and perturbation of the age estimates.

Journal ArticleDOI
TL;DR: An evolutionary selection algorithm that seeks global agreement among surface points, while operating at a local level is adopted, allowing us to attack a more challenging scenario where model and scene have different, unknown scales.
Abstract: During the last years a wide range of algorithms and devices have been made available to easily acquire range images. The increasing abundance of depth data boosts the need for reliable and unsupervised analysis techniques, spanning from part registration to automated segmentation. In this context, we focus on the recognition of known objects in cluttered and incomplete 3D scans. Locating and fitting a model to a scene are very important tasks in many scenarios such as industrial inspection, scene understanding, medical imaging and even gaming. For this reason, these problems have been addressed extensively in the literature. Several of the proposed methods adopt local descriptor-based approaches, while a number of hurdles still hinder the use of global techniques. In this paper we offer a different perspective on the topic: We adopt an evolutionary selection algorithm that seeks global agreement among surface points, while operating at a local level. The approach effectively extends the scope of local descriptors by actively selecting correspondences that satisfy global consistency constraints, allowing us to attack a more challenging scenario where model and scene have different, unknown scales. This leads to a novel and very effective pipeline for 3D object recognition, which is validated with an extensive set of experiments and comparisons with recent techniques at the state of the art.

Journal ArticleDOI
TL;DR: A novel approach to recovering and grouping the symmetric parts of an object from a cluttered scene by using a multiresolution superpixel segmentation to generate medial point hypotheses, and using a learned affinity function to perceptually group nearby medial points likely to belong to the same medial branch.
Abstract: Skeletonization algorithms typically decompose an object's silhouette into a set of symmetric parts, offering a powerful representation for shape categorization. However, having access to an object's silhouette assumes correct figure-ground segmentation, leading to a disconnect with the mainstream categorization community, which attempts to recognize objects from cluttered images. In this paper, we present a novel approach to recovering and grouping the symmetric parts of an object from a cluttered scene. We begin by using a multiresolution superpixel segmentation to generate medial point hypotheses, and use a learned affinity function to perceptually group nearby medial points likely to belong to the same medial branch. In the next stage, we learn higher granularity affinity functions to group the resulting medial branches likely to belong to the same object. The resulting framework yields a skeletal approximation that is free of many of the instabilities that occur with traditional skeletons. More importantly, it does not require a closed contour, enabling the application of skeleton-based categorization systems to more realistic imagery.

Journal ArticleDOI
TL;DR: It is demonstrated that the performance of specific object retrieval increases with the size of the vocabulary and that the large vocabularies increase the speed of the tf-idf scoring step.
Abstract: A novel similarity measure for bag-of-words type large scale image retrieval is presented. The similarity function is learned in an unsupervised manner, requires no extra space over the standard bag-of-words method and is more discriminative than both L2-based soft assignment and Hamming embedding. The novel similarity function achieves mean average precision that is superior to any result published in the literature on the standard Oxford 5k, Oxford 105k and Paris datasets/protocols. We study the effect of a fine quantization and very large vocabularies (up to 64 million words) and show that the performance of specific object retrieval increases with the size of the vocabulary. This observation is in contradiction with previously published results. We further demonstrate that the large vocabularies increase the speed of the tf-idf scoring step.

Journal ArticleDOI
TL;DR: This work presents an alternating minimization algorithm to solve the resulting composite photometric/geometric inverse problem and derive the shape gradient of the data-fitting energy and investigate convex relaxation for the geometric problem.
Abstract: We introduce a new class of data-fitting energies that couple image segmentation with image restoration. These functionals model the image intensity using the statistical framework of generalized linear models. By duality, we establish an information-theoretic interpretation using Bregman divergences. We demonstrate how this formulation couples in a principled way image restoration tasks such as denoising, deblurring (deconvolution), and inpainting with segmentation. We present an alternating minimization algorithm to solve the resulting composite photometric/geometric inverse problem. We use Fisher scoring to solve the photometric problem and to provide asymptotic uncertainty estimates. We derive the shape gradient of our data-fitting energy and investigate convex relaxation for the geometric problem. We introduce a new alternating split-Bregman strategy to solve the resulting convex problem and present experiments and comparisons on both synthetic and real-world images.

Journal ArticleDOI
TL;DR: This paper presents a general approach based on the shape similarity tree for non-sequential alignment across databases of multiple unstructured mesh sequences from non-rigid surface capture that allows alignment across multiple sequences of different motions, reduces drift in sequential alignment and is robust to rapid non- Rigid motion.
Abstract: This paper presents a general approach based on the shape similarity tree for non-sequential alignment across databases of multiple unstructured mesh sequences from non-rigid surface capture. The optimal shape similarity tree for non-rigid alignment is defined as the minimum spanning tree in shape similarity space. Non-sequential alignment based on the shape similarity tree minimises the total non-rigid deformation required to register all frames in a database into a consistent mesh structure with surfaces in correspondence. This allows alignment across multiple sequences of different motions, reduces drift in sequential alignment and is robust to rapid non-rigid motion. Evaluation is performed on three benchmark databases of 3D mesh sequences with a variety of complex human and cloth motion. Comparison with sequential alignment demonstrates reduced errors due to drift and improved robustness to large non-rigid deformation, together with global alignment across multiple sequences which is not possible with previous sequential approaches.

Journal ArticleDOI
TL;DR: A novel framework for tracking thousands of vehicles in high resolution, low frame rate, multiple camera aerial videos and representation of object state in terms of many to many data associations per track is proposed and multiple novel constraints are introduced to make the association problem tractable while allowing sharing of detections among tracks.
Abstract: This paper presents a novel framework for tracking thousands of vehicles in high resolution, low frame rate, multiple camera aerial videos. The proposed algorithm avoids the pitfalls of global minimization of data association costs and instead maintains multiple object-centric associations for each track. Representation of object state in terms of many to many data associations per track is proposed and multiple novel constraints are introduced to make the association problem tractable while allowing sharing of detections among tracks. Weighted hypothetical measurements are introduced to better handle occlusions, mis-detections and split or merged detections. A two-frame differencing method is presented which performs simultaneous moving object detection in both. Two novel contextual constraints of vehicle following model, and discouragement of track intersection and merging are also proposed. Extensive experiments on challenging, ground truthed data sets are performed to show the feasibility and superiority of the proposed approach. Results of quantitative comparison with existing approaches are presented, and the efficacy of newly introduced constraints is experimentally established. The proposed algorithm performs better and faster than global, 1---1 data association methods.

Journal ArticleDOI
TL;DR: This paper proposes a kernel PCA method for fast and robust PCA, which it is shown that Euler-PCA retains PCA’s desirable properties while suppressing outliers, and utilizes a robust dissimilarity measure based on the Euler representation of complex numbers.
Abstract: Principal Component Analysis (PCA) is perhaps the most prominent learning tool for dimensionality reduction in pattern recognition and computer vision. However, the ? 2-norm employed by standard PCA is not robust to outliers. In this paper, we propose a kernel PCA method for fast and robust PCA, which we call Euler-PCA (e-PCA). In particular, our algorithm utilizes a robust dissimilarity measure based on the Euler representation of complex numbers. We show that Euler-PCA retains PCA's desirable properties while suppressing outliers. Moreover, we formulate Euler-PCA in an incremental learning framework which allows for efficient computation. In our experiments we apply Euler-PCA to three different computer vision applications for which our method performs comparably with other state-of-the-art approaches.

Journal ArticleDOI
TL;DR: A novel fully automatic 2D/3D global registration pipeline consisting of several stages that simultaneously register the input image set on the corresponding 3D object, capable of dealing with small and big objects of any shape, and robust.
Abstract: The photorealistic acquisition of 3D objects often requires color information from digital photography to be mapped on the acquired geometry, in order to obtain a textured 3D model. This paper presents a novel fully automatic 2D/3D global registration pipeline consisting of several stages that simultaneously register the input image set on the corresponding 3D object. The first stage exploits Structure From Motion (SFM) on the image set in order to generate a sparse point cloud. During the second stage, this point cloud is aligned to the 3D object using an extension of the 4 Point Congruent Set (4PCS) algorithm for the alignment of range maps. The extension accounts for models with different scales and unknown regions of overlap. In the last processing stage a global refinement algorithm based on mutual information optimizes the color projection of the aligned photos on the 3D object, in order to obtain high quality textures. The proposed registration pipeline is general, capable of dealing with small and big objects of any shape, and robust. We present results from six real cases, evaluating the quality of the final colors mapped onto the 3D object. A comparison with a ground truth dataset is also presented.