scispace - formally typeset
Search or ask a question

Showing papers in "International Journal of Computer Vision in 2010"


Journal ArticleDOI
TL;DR: The state-of-the-art in evaluated methods for both classification and detection are reviewed, whether the methods are statistically different, what they are learning from the images, and what the methods find easy or confuse.
Abstract: The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.

15,935 citations


Journal ArticleDOI
TL;DR: A baseline algorithm for 3D articulated tracking that uses a relatively standard Bayesian framework with optimization in the form of Sequential Importance Resampling and Annealed Particle Filtering is described, and a variety of likelihood functions, prior models of human motion and the effects of algorithm parameters are explored.
Abstract: While research on articulated human motion and pose estimation has progressed rapidly in the last few years, there has been no systematic quantitative evaluation of competing methods to establish the current state of the art. We present data obtained using a hardware system that is able to capture synchronized video and ground-truth 3D motion. The resulting HumanEva datasets contain multiple subjects performing a set of predefined actions with a number of repetitions. On the order of 40,000 frames of synchronized motion capture and multi-view video (resulting in over one quarter million image frames in total) were collected at 60 Hz with an additional 37,000 time instants of pure motion capture data. A standard set of error measures is defined for evaluating both 2D and 3D pose estimation and tracking algorithms. We also describe a baseline algorithm for 3D articulated tracking that uses a relatively standard Bayesian framework with optimization in the form of Sequential Importance Resampling and Annealed Particle Filtering. In the context of this baseline algorithm we explore a variety of likelihood functions, prior models of human motion and the effects of algorithm parameters. Our experiments suggest that image observation models and motion priors play important roles in performance, and that in a multi-view laboratory environment, where initialization is available, Bayesian filtering tends to perform well. The datasets and the software are made available to the research community. This infrastructure will support the development of new articulated motion and pose estimation algorithms, will provide a baseline for the evaluation and comparison of new methods, and will help establish the current state of the art in human pose estimation and tracking.

1,130 citations


Journal ArticleDOI
TL;DR: A more precise representation based on Hamming embedding (HE) and weak geometric consistency constraints (WGC) is derived and this approach is shown to outperform the state-of-the-art on the three datasets.
Abstract: This article improves recent methods for large scale image search. We first analyze the bag-of-features approach in the framework of approximate nearest neighbor search. This leads us to derive a more precise representation based on Hamming embedding (HE) and weak geometric consistency constraints (WGC). HE provides binary signatures that refine the matching based on visual words. WGC filters matching descriptors that are not consistent in terms of angle and scale. HE and WGC are integrated within an inverted file and are efficiently exploited for all images in the dataset. We then introduce a graph-structured quantizer which significantly speeds up the assignment of the descriptors to visual words. A comparison with the state of the art shows the interest of our approach when high accuracy is needed. Experiments performed on three reference datasets and a dataset of one million of images show a significant improvement due to the binary signature and the weak geometric consistency constraints, as well as their efficiency. Estimation of the full geometric transformation, i.e., a re-ranking step on a short-list of images, is shown to be complementary to our weak geometric consistency constraints. Our approach is shown to outperform the state-of-the-art on the three datasets.

795 citations


Journal ArticleDOI
TL;DR: An algorithm for the detection of highly repeatable keypoints on 3D models and partial views of objects and an automatic scale selection technique for extracting multi-scale and scale invariant features to match objects at different unknown scales are presented.
Abstract: 3D object recognition from local features is robust to occlusions and clutter However, local features must be extracted from a small set of feature rich keypoints to avoid computational complexity and ambiguous features We present an algorithm for the detection of such keypoints on 3D models and partial views of objects The keypoints are highly repeatable between partial views of an object and its complete 3D model We also propose a quality measure to rank the keypoints and select the best ones for extracting local features Keypoints are identified at locations where a unique local 3D coordinate basis can be derived from the underlying surface in order to extract invariant features We also propose an automatic scale selection technique for extracting multi-scale and scale invariant features to match objects at different unknown scales Features are projected to a PCA subspace and matched to find correspondences between a database and query object Each pair of matching features gives a transformation that aligns the query and database object These transformations are clustered and the biggest cluster is used to identify the query object Experiments on a public database revealed that the proposed quality measure relates correctly to the repeatability of keypoints and the multi-scale features have a recognition rate of over 95% for up to 80% occluded objects

432 citations


Journal ArticleDOI
TL;DR: This work reviews the evolution of the nonparametric regression modeling in imaging from the local Nadaraya-Watson kernel estimate to the nonlocal means and further to transform-domain filtering based on nonlocal block-matching.
Abstract: We review the evolution of the nonparametric regression modeling in imaging from the local Nadaraya-Watson kernel estimate to the nonlocal means and further to transform-domain filtering based on nonlocal block-matching. The considered methods are classified mainly according to two main features: local/nonlocal and pointwise/multipoint. Here nonlocal is an alternative to local, and multipoint is an alternative to pointwise. These alternatives, though obvious simplifications, allow to impose a fruitful and transparent classification of the basic ideas in the advanced techniques. Within this framework, we introduce a novel single- and multiple-model transform domain nonlocal approach. The Block Matching and 3-D Filtering (BM3D) algorithm, which is currently one of the best performing denoising algorithms, is treated as a special case of the latter approach.

382 citations


Journal ArticleDOI
TL;DR: Frequency analysis allows for greater accuracy in the removal of dynamic weather and in the performance of feature extraction than previous pixel-based or patch-based methods and is effective for videos with both scene and camera motions.
Abstract: Dynamic weather such as rain and snow causes complex spatio-temporal intensity fluctuations in videos. Such fluctuations can adversely impact vision systems that rely on small image features for tracking, object detection and recognition. While these effects appear to be chaotic in space and time, we show that dynamic weather has a predictable global effect in frequency space. For this, we first develop a model of the shape and appearance of a single rain or snow streak in image space. Detecting individual streaks is difficult even with an accurate appearance model, so we combine the streak model with the statistical characteristics of rain and snow to create a model of the overall effect of dynamic weather in frequency space. Our model is then fit to a video and is used to detect rain or snow streaks first in frequency space, and the detection result is then transferred to image space. Once detected, the amount of rain or snow can be reduced or increased. We demonstrate that our frequency analysis allows for greater accuracy in the removal of dynamic weather and in the performance of feature extraction than previous pixel-based or patch-based methods. We also show that unlike previous techniques, our approach is effective for videos with both scene and camera motions.

357 citations


Journal ArticleDOI
TL;DR: An object class detection approach which fully integrates the complementary strengths offered by shape matchers and can localize object boundaries accurately and does not need segmented examples for training (only bounding-boxes).
Abstract: We present an object class detection approach which fully integrates the complementary strengths offered by shape matchers. Like an object detector, it can learn class models directly from images, and can localize novel instances in the presence of intra-class variations, clutter, and scale changes. Like a shape matcher, it finds the boundaries of objects, rather than just their bounding-boxes. This is achieved by a novel technique for learning a shape model of an object class given images of example instances. Furthermore, we also integrate Hough-style voting with a non-rigid point matching algorithm to localize the model in cluttered images. As demonstrated by an extensive evaluation, our method can localize object boundaries accurately and does not need segmented examples for training (only bounding-boxes).

339 citations


Journal ArticleDOI
TL;DR: This paper presents a novel object recognition algorithm that performs automatic dataset collecting and incremental model learning simultaneously, and adapts a non-parametric latent topic model and proposes an incremental learning framework.
Abstract: The explosion of the Internet provides us with a tremendous resource of images shared online. It also confronts vision researchers the problem of finding effective methods to navigate the vast amount of visual information. Semantic image understanding plays a vital role towards solving this problem. One important task in image understanding is object recognition, in particular, generic object categorization. Critical to this problem are the issues of learning and dataset. Abundant data helps to train a robust recognition system, while a good object classifier can help to collect a large amount of images. This paper presents a novel object recognition algorithm that performs automatic dataset collecting and incremental model learning simultaneously. The goal of this work is to use the tremendous resources of the web to learn robust object category models for detecting and searching for objects in real-world cluttered scenes. Humans contiguously update the knowledge of objects when new examples are observed. Our framework emulates this human learning process by iteratively accumulating model knowledge and image examples. We adapt a non-parametric latent topic model and propose an incremental learning framework. Our algorithm is capable of automatically collecting much larger object category datasets for 22 randomly selected classes from the Caltech 101 dataset. Furthermore, our system offers not only more images in each object category but also a robust object category model and meaningful image annotation. Our experiments show that OPTIMOL is capable of collecting image datasets that are superior to the well known manually collected object datasets Caltech 101 and LabelMe.

338 citations


Journal ArticleDOI
TL;DR: This paper explores the applicability of diffusion distances within the Gromov-Hausdorff framework and finds that in addition to the relatively low complexity involved in the computation of the diffusion distances between surface points, its recognition and matching performances favorably compare to the classical geodesic distances in the presence of topological changes between the non-rigid shapes.
Abstract: In this paper, the problem of non-rigid shape recognition is studied from the perspective of metric geometry. In particular, we explore the applicability of diffusion distances within the Gromov-Hausdorff framework. While the traditionally used geodesic distance exploits the shortest path between points on the surface, the diffusion distance averages all paths connecting the points. The diffusion distance constitutes an intrinsic metric which is robust, in particular, to topological changes. Such changes in the form of shortcuts, holes, and missing data may be a result of natural non-rigid deformations as well as acquisition and representation noise due to inaccurate surface construction. The presentation of the proposed framework is complemented with examples demonstrating that in addition to the relatively low complexity involved in the computation of the diffusion distances between surface points, its recognition and matching performances favorably compare to the classical geodesic distances in the presence of topological changes between the non-rigid shapes.

306 citations


Journal ArticleDOI
TL;DR: Twin Gaussian processes (TGP), a generic structured prediction method that uses Gaussian process priors on both covariates and responses, both multivariate, and estimates outputs by minimizing the Kullback-Leibler divergence between two GP modeled as normal distributions over finite index sets of training and testing examples, is described.
Abstract: We describe twin Gaussian processes (TGP), a generic structured prediction method that uses Gaussian process (GP) priors on both covariates and responses, both multivariate, and estimates outputs by minimizing the Kullback-Leibler divergence between two GP modeled as normal distributions over finite index sets of training and testing examples, emphasizing the goal that similar inputs should produce similar percepts and this should hold, on average, between their marginal distributions. TGP captures not only the interdependencies between covariates, as in a typical GP, but also those between responses, so correlations among both inputs and outputs are accounted for. TGP is exemplified, with promising results, for the reconstruction of 3d human poses from monocular and multicamera video sequences in the recently introduced HumanEva benchmark, where we achieve 5 cm error on average per 3d marker for models trained jointly, using data from multiple people and multiple activities. The method is fast and automatic: it requires no hand-crafting of the initial pose, camera calibration parameters, or the availability of a 3d body model associated with human subjects used for training or testing.

303 citations


Journal ArticleDOI
TL;DR: A multi-layer framework that combines stochastic optimization, filtering, and local optimization is introduced and quantitative 3D pose tracking results for the complete HumanEva-II dataset are provided.
Abstract: Local optimization and filtering have been widely applied to model-based 3D human motion capture. Global stochastic optimization has recently been proposed as promising alternative solution for tracking and initialization. In order to benefit from optimization and filtering, we introduce a multi-layer framework that combines stochastic optimization, filtering, and local optimization. While the first layer relies on interacting simulated annealing and some weak prior information on physical constraints, the second layer refines the estimates by filtering and local optimization such that the accuracy is increased and ambiguities are resolved over time without imposing restrictions on the dynamics. In our experimental evaluation, we demonstrate the significant improvements of the multi-layer framework and provide quantitative 3D pose tracking results for the complete HumanEva-II dataset. The paper further comprises a comparison of global stochastic optimization with particle filtering, annealed particle filtering, and local optimization.

Journal ArticleDOI
TL;DR: This paper extends Nadaraya-Watson kernel regression by recasting the regression problem in terms of Fréchet expectation, and uses the infinite dimensional manifold of diffeomorphic transformations, with an associated metric, to study the small scale changes in anatomy.
Abstract: Regression analysis is a powerful tool for the study of changes in a dependent variable as a function of an independent regressor variable, and in particular it is applicable to the study of anatomical growth and shape change. When the underlying process can be modeled by parameters in a Euclidean space, classical regression techniques (Hardle, Applied Nonparametric Regression, 1990; Wand and Jones, Kernel Smoothing, 1995) are applicable and have been studied extensively. However, recent work suggests that attempts to describe anatomical shapes using flat Euclidean spaces undermines our ability to represent natural biological variability (Fletcher et al., IEEE Trans. Med. Imaging 23(8), 995---1005, 2004; Grenander and Miller, Q. Appl. Math. 56(4), 617---694, 1998). In this paper we develop a method for regression analysis of general, manifold-valued data. Specifically, we extend Nadaraya-Watson kernel regression by recasting the regression problem in terms of Frechet expectation. Although this method is quite general, our driving problem is the study anatomical shape change as a function of age from random design image data. We demonstrate our method by analyzing shape change in the brain from a random design dataset of MR images of 97 healthy adults ranging in age from 20 to 79 years. To study the small scale changes in anatomy, we use the infinite dimensional manifold of diffeomorphic transformations, with an associated metric. We regress a representative anatomical shape, as a function of age, from this population.

Journal ArticleDOI
TL;DR: It is analytically shown that the proposed gamut mapping framework is able to include any linear filter output and it is shown that derivatives have the advantage over pixel values to be invariant to disturbing effects (i.e. deviations of the diagonal model) such as saturated colors and diffuse light.
Abstract: The gamut mapping algorithm is one of the most promising methods to achieve computational color constancy. However, so far, gamut mapping algorithms are restricted to the use of pixel values to estimate the illuminant. Therefore, in this paper, gamut mapping is extended to incorporate the statistical nature of images. It is analytically shown that the proposed gamut mapping framework is able to include any linear filter output. The main focus is on the local n-jet describing the derivative structure of an image. It is shown that derivatives have the advantage over pixel values to be invariant to disturbing effects (i.e. deviations of the diagonal model) such as saturated colors and diffuse light. Further, as the n-jet based gamut mapping has the ability to use more information than pixel values alone, the combination of these algorithms are more stable than the regular gamut mapping algorithm. Different methods of combining are proposed. Based on theoretical and experimental results conducted on large scale data sets of hyperspectral, laboratory and real-world scenes, it can be derived that (1) in case of deviations of the diagonal model, the derivative-based approach outperforms the pixel-based gamut mapping, (2) state-of-the-art algorithms are outperformed by the n-jet based gamut mapping, (3) the combination of the different n-jet based gamut mappings provide more stable solutions, and (4) the fusion strategy based on the intersection of feasible sets provides better color constancy results than the union of the feasible sets.

Journal ArticleDOI
TL;DR: A novel method ICF (Identifying point correspondence by Correspondence Function) is proposed for rejecting mismatches from given putative point correspondences, and it is applicable to images of rigid objects or images of non-rigid objects with unknown deformation.
Abstract: A novel method ICF (Identifying point correspondences by Correspondence Function) is proposed for rejecting mismatches from given putative point correspondences. By analyzing the connotation of homography, we introduce a novel concept of correspondence function for two images of a general 3D scene, which captures the relationships between corresponding points by mapping a point in one image to its corresponding point in another. Since the correspondence functions are unknown in real applications, we also study how to estimate them from given putative correspondences, and propose an algorithm IECF (Iteratively Estimate Correspondence Function) based on diagnostic technique and SVM. Then, the proposed ICF method is able to reject the mismatches by checking whether they are consistent with the estimated correspondence functions. Extensive experiments on real images demonstrate the excellent performance of our proposed method. In addition, the ICF is a general method for rejecting mismatches, and it is applicable to images of rigid objects or images of non-rigid objects with unknown deformation.

Journal ArticleDOI
TL;DR: The goal of this paper is to discover the objects present in the images by analyzing unlabeled data and searching for re-occurring patterns, and a rigorous framework for evaluating unsupervised object discovery methods is proposed.
Abstract: The goal of this paper is to evaluate and compare models and methods for learning to recognize basic entities in images in an unsupervised setting. In other words, we want to discover the objects present in the images by analyzing unlabeled data and searching for re-occurring patterns. We experiment with various baseline methods, methods based on latent variable models, as well as spectral clustering methods. The results are presented and compared both on subsets of Caltech256 and MSRC2, data sets that are larger and more challenging and that include more object classes than what has previously been reported in the literature. A rigorous framework for evaluating unsupervised object discovery methods is proposed.

Journal ArticleDOI
TL;DR: This paper investigates the performance of an approach which represents textures as histograms over a visual vocabulary which is defined geometrically, based on the Basic Image Features of Griffin and Lillholm, rather than by clustering.
Abstract: Representing texture images statistically as histograms over a discrete vocabulary of local features has proven widely effective for texture classification tasks. Images are described locally by vectors of, for example, responses to some filter bank; and a visual vocabulary is defined as a partition of this descriptor-response space, typically based on clustering. In this paper, we investigate the performance of an approach which represents textures as histograms over a visual vocabulary which is defined geometrically, based on the Basic Image Features of Griffin and Lillholm (Proc. SPIE 6492(09):1---11, 2007), rather than by clustering. BIFs provide a natural mathematical quantisation of a filter-response space into qualitatively distinct types of local image structure. We also extend our approach to deal with intra-class variations in scale. Our algorithm is simple: there is no need for a pre-training step to learn a visual dictionary, as in methods based on clustering, and no tuning of parameters is required to deal with different datasets. We have tested our implementation on three popular and challenging texture datasets and find that it produces consistently good classification results on each, including what we believe to be the best reported for the KTH-TIPS and equal best reported for the UIUCTex databases.

Journal ArticleDOI
TL;DR: An approach for accurately measuring human motion through Markerless Motion Capture (MMC) is presented that uses multiple color cameras and combines an accurate and anatomically consistent tracking algorithm with a method for automatically generating subject specific models.
Abstract: An approach for accurately measuring human motion through Markerless Motion Capture (MMC) is presented. The method uses multiple color cameras and combines an accurate and anatomically consistent tracking algorithm with a method for automatically generating subject specific models. The tracking approach employed a Levenberg-Marquardt minimization scheme over an iterative closest point algorithm with six degrees of freedom for each body segment. Anatomical consistency was maintained by enforcing rotational and translational joint range of motion constraints for each specific joint. A subject specific model of the subjects was obtained through an automatic model generation algorithm (Corazza et al. in IEEE Trans. Biomed. Eng., 2009) which combines a space of human shapes (Anguelov et al. in Proceedings SIGGRAPH, 2005) with biomechanically consistent kinematic models and a pose-shape matching algorithm. There were 15 anatomical body segments and 14 joints, each with six degrees of freedom (13 and 12, respectively for the HumanEva II dataset). The overall method is an improvement over (Mundermann et al. in Proceedings of CVPR, 2007) in terms of both accuracy and robustness. Since the method was originally developed for ?8 cameras, the method performance was tested both (i) on the HumanEva II dataset (Sigal and Black, Technical Report CS-06-08, 2006) in a 4 camera configuration, (ii) on a series of motions including walking trials, a very challenging gymnastic motion and a dataset with motions similar to HumanEva II but with variable number of cameras.

Journal ArticleDOI
TL;DR: This work introduces a new and simple baseline technique for image annotation that treats annotation as a retrieval problem and outperforms the current state-of-the-art methods on two standard and one large Web dataset.
Abstract: Automatically assigning keywords to images is of great interest as it allows one to retrieve, index, organize and understand large collections of image data. Many techniques have been proposed for image annotation in the last decade that give reasonable performance on standard datasets. However, most of these works fail to compare their methods with simple baseline techniques to justify the need for complex models and subsequent training. In this work, we introduce a new and simple baseline technique for image annotation that treats annotation as a retrieval problem. The proposed technique utilizes global low-level image features and a simple combination of basic distance measures to find nearest neighbors of a given image. The keywords are then assigned using a greedy label transfer mechanism. The proposed baseline method outperforms the current state-of-the-art methods on two standard and one large Web dataset. We believe that such a baseline measure will provide a strong platform to compare and better understand future annotation techniques.

Journal ArticleDOI
TL;DR: A novel 3D shape descriptor that uses a set of panoramic views of a 3D object which describe the position and orientation of the object’s surface in 3D space to increase the retrieval performance by employing a local (unsupervised) relevance feedback technique that shifts the descriptor of an object closer to its cluster centroid in feature space.
Abstract: We present a novel 3D shape descriptor that uses a set of panoramic views of a 3D object which describe the position and orientation of the object's surface in 3D space. We obtain a panoramic view of a 3D object by projecting it to the lateral surface of a cylinder parallel to one of its three principal axes and centered at the centroid of the object. The object is projected to three perpendicular cylinders, each one aligned with one of its principal axes in order to capture the global shape of the object. For each projection we compute the corresponding 2D Discrete Fourier Transform as well as 2D Discrete Wavelet Transform. We further increase the retrieval performance by employing a local (unsupervised) relevance feedback technique that shifts the descriptor of an object closer to its cluster centroid in feature space. The effectiveness of the proposed 3D object retrieval methodology is demonstrated via an extensive consistent evaluation in standard benchmarks that clearly shows better performance against state-of-the-art 3D object retrieval methods.

Journal ArticleDOI
TL;DR: This work shows that with an appropriate combination of kernels a significant boost in classification performance is possible, and indicates the utility of active learning with probabilistic predictive models, especially when the amount of training data labels that may be sought for a category is ultimately very small.
Abstract: Discriminative methods for visual object category recognition are typically non-probabilistic, predicting class labels but not directly providing an estimate of uncertainty. Gaussian Processes (GPs) provide a framework for deriving regression techniques with explicit uncertainty models; we show here how Gaussian Processes with covariance functions defined based on a Pyramid Match Kernel (PMK) can be used for probabilistic object category recognition. Our probabilistic formulation provides a principled way to learn hyperparameters, which we utilize to learn an optimal combination of multiple covariance functions. It also offers confidence estimates at test points, and naturally allows for an active learning paradigm in which points are optimally selected for interactive labeling. We show that with an appropriate combination of kernels a significant boost in classification performance is possible. Further, our experiments indicate the utility of active learning with probabilistic predictive models, especially when the amount of training data labels that may be sought for a category is ultimately very small.

Journal ArticleDOI
TL;DR: The proposed method supports multimodal queries (2D images, sketches, 3D objects) by introducing a novel view-based approach able to handle the different types of multimedia data.
Abstract: This paper presents a unified framework for 3D shape retrieval. The method supports multimodal queries (2D images, sketches, 3D objects) by introducing a novel view-based approach able to handle the different types of multimedia data. More specifically, a set of 2D images (multi-views) are automatically generated from a 3D object, by taking views from uniformly distributed viewpoints. For each image, a set of 2D rotation-invariant shape descriptors is produced. The global shape similarity between two 3D models is achieved by applying a novel matching scheme, which effectively combines the information extracted from the multi-view representation. The experimental results prove that the proposed method demonstrates superior performance over other well-known state-of-the-art approaches.

Journal ArticleDOI
TL;DR: This work shows that learning the time delayed activity correlations offers important contextual information for (i) spatial and temporal topology inference of a camera network; (ii) robust person re-identification and (iii) global activity interpretation and video temporal segmentation.
Abstract: We propose a novel approach to understanding activities from their partial observations monitored through multiple non-overlapping cameras separated by unknown time gaps. In our approach, each camera view is first decomposed automatically into regions based on the correlation of object dynamics across different spatial locations in all camera views. A new Cross Canonical Correlation Analysis (xCCA) is then formulated to discover and quantify the time delayed correlations of regional activities observed within and across multiple camera views in a single common reference space. We show that learning the time delayed activity correlations offers important contextual information for (i) spatial and temporal topology inference of a camera network; (ii) robust person re-identification and (iii) global activity interpretation and video temporal segmentation. Crucially, in contrast to conventional methods, our approach does not rely on either intra-camera or inter-camera object tracking; it thus can be applied to low-quality surveillance videos featured with severe inter-object occlusions. The effectiveness and robustness of our approach are demonstrated through experiments on 330 hours of videos captured from 17 cameras installed at two busy underground stations with complex and diverse scenes.

Journal ArticleDOI
TL;DR: A novel anthropometric three dimensional (Anthroface 3D) face recognition algorithm, which is based on a systematically selected set of discriminatory structural characteristics of the human face derived from the existing scientific literature on facial anthropometry, is presented.
Abstract: We present a novel anthropometric three dimensional (Anthroface 3D) face recognition algorithm, which is based on a systematically selected set of discriminatory structural characteristics of the human face derived from the existing scientific literature on facial anthropometry. We propose a novel technique for automatically detecting 10 anthropometric facial fiducial points that are associated with these discriminatory anthropometric features. We isolate and employ unique textural and/or structural characteristics of these fiducial points, along with the established anthropometric facial proportions of the human face for detecting them. Lastly, we develop a completely automatic face recognition algorithm that employs facial 3D Euclidean and geodesic distances between these 10 automatically located anthropometric facial fiducial points and a linear discriminant classifier. On a database of 1149 facial images of 118 subjects, we show that the standard deviation of the Euclidean distance of each automatically detected fiducial point from its manually identified position is less than 2.54 mm. We further show that the proposed Anthroface 3D recognition algorithm performs well (equal error rate of 1.98% and a rank 1 recognition rate of 96.8%), out performs three of the existing benchmark 3D face recognition algorithms, and is robust to the observed fiducial point localization errors.

Journal ArticleDOI
TL;DR: An active basis model, a shared sketch algorithm, and a computational architecture of sum-max maps for representing, learning, and recognizing deformable templates are proposed.
Abstract: This article proposes an active basis model, a shared sketch algorithm, and a computational architecture of sum-max maps for representing, learning, and recognizing deformable templates. In our generative model, a deformable template is in the form of an active basis, which consists of a small number of Gabor wavelet elements at selected locations and orientations. These elements are allowed to slightly perturb their locations and orientations before they are linearly combined to generate the observed image. The active basis model, in particular, the locations and the orientations of the basis elements, can be learned from training images by the shared sketch algorithm. The algorithm selects the elements of the active basis sequentially from a dictionary of Gabor wavelets. When an element is selected at each step, the element is shared by all the training images, and the element is perturbed to encode or sketch a nearby edge segment in each training image. The recognition of the deformable template from an image can be accomplished by a computational architecture that alternates the sum maps and the max maps. The computation of the max maps deforms the active basis to match the image data, and the computation of the sum maps scores the template matching by the log-likelihood of the deformed active basis.

Journal ArticleDOI
TL;DR: This paper defines a similarity measure between two parts based not only on their local signatures and geometry, but also on their context within the shape to which they belong, and presents results on finding part analogies among numerous objects from shape repositories.
Abstract: In this paper we address the problem of finding analogies between parts of 3D objects. By partitioning an object into meaningful parts and finding analogous parts in other objects, not necessarily of the same type, many analysis and modeling tasks could be enhanced. For instance, partial match queries can be formulated, annotation of parts in objects can be utilized, and modeling-by-parts applications could be supported. We define a similarity measure between two parts based not only on their local signatures and geometry, but also on their context within the shape to which they belong. In our approach, all objects are hierarchically segmented (e.g. using the shape diameter function), and each part is given a local signature. However, to find corresponding parts in other objects we use a context enhanced part-in-whole matching. Our matching function is based on bi-partite graph matching and is computed using a flow algorithm which takes into account both local geometrical features and the partitioning hierarchy. We present results on finding part analogies among numerous objects from shape repositories, and demonstrate sub-part queries using an implementation of a simple search and retrieval application. We also demonstrate a simple annotation tool that carries textual tags of object parts from one model to many others using analogies, laying a basis for semantic text based search.

Journal ArticleDOI
TL;DR: This work introduces a method to hierarchically segment articulated shapes into meaningful parts and to register these parts across populations of near-isometric shapes (e.g. head, arms, legs and fingers of humans in different body postures).
Abstract: This work introduces a method to hierarchically segment articulated shapes into meaningful parts and to register these parts across populations of near-isometric shapes (e.g. head, arms, legs and fingers of humans in different body postures). The method exploits the isometry invariance of eigenfunctions of the Laplace-Beltrami operator and uses topological features (level sets at important saddles) for the segmentation. Concepts from persistent homology are employed for a hierarchical representation, for the elimination of topological noise and for the comparison of eigenfunctions. The obtained parts can be registered via their spectral embedding across a population of near isometric shapes. This work also presents the highly accurate computation of eigenfunctions and eigenvalues with cubic finite elements on triangle meshes and discusses the construction of persistence diagrams from the Morse-Smale complex as well as the relation to size functions.

Journal ArticleDOI
TL;DR: This work presents a practicable and expandable probabilistic framework for parts-based object class representation, enabling the detection of rigid and articulated object classes in arbitrary views and investigates learning of this representation from labelled training images and infer globally optimal solutions to the contextual MAP-detection problem.
Abstract: Object detection is one of the key components in modern computer vision systems While the detection of a specific rigid object under changing viewpoints was considered hard just a few years ago, current research strives to detect and recognize classes of non-rigid, articulated objects Hampered by the omnipresent confusing information due to clutter and occlusion, the focus has shifted from holistic approaches for object detection to representations of individual object parts linked by structural information, along with richer contextual descriptions of object configurations Along this line of research, we present a practicable and expandable probabilistic framework for parts-based object class representation, enabling the detection of rigid and articulated object classes in arbitrary views We investigate learning of this representation from labelled training images and infer globally optimal solutions to the contextual MAP-detection problem, using A *-search with a novel lower-bound as admissible heuristic An assessment of the inference performance of Belief-Propagation and Tree-Reweighted Belief Propagation is obtained as a by-product The generality of our approach is demonstrated on four different datasets utilizing domain dependent information cues

Journal ArticleDOI
TL;DR: This work proposes a multi-task learning algorithm to learn the task-related “stimulus-saliency” mapping functions for each scene, which learns various fusion strategies, which are used to integrate the stimulus-driven and task- related components to obtain the visual saliency.
Abstract: In this paper, we present a probabilistic multi-task learning approach for visual saliency estimation in video. In our approach, the problem of visual saliency estimation is modeled by simultaneously considering the stimulus-driven and task-related factors in a probabilistic framework. In this framework, a stimulus-driven component simulates the low-level processes in human vision system using multi-scale wavelet decomposition and unbiased feature competition; while a task-related component simulates the high-level processes to bias the competition of the input features. Different from existing approaches, we propose a multi-task learning algorithm to learn the task-related "stimulus-saliency" mapping functions for each scene. The algorithm also learns various fusion strategies, which are used to integrate the stimulus-driven and task-related components to obtain the visual saliency. Extensive experiments were carried out on two public eye-fixation datasets and one regional saliency dataset. Experimental results show that our approach outperforms eight state-of-the-art approaches remarkably.

Journal ArticleDOI
TL;DR: A new texture analysis scheme is introduced, which is invariant to local geometric and radiometric changes, and the obtained experimental results outperform the current state of the art in locally invariant texture analysis.
Abstract: This paper introduces a new texture analysis scheme, which is invariant to local geometric and radiometric changes. The proposed methodology relies on the topographic map of images, obtained from the connected components of level sets. This morphological tool, providing a multi-scale and contrast-invariant representation of images, is shown to be well suited to texture analysis. We first make use of invariant moments to extract geometrical information from the topographic map. This yields features that are invariant to local similarities or local affine transformations. These features are invariant to any local contrast change. We then relax this invariance by computing additional features that are invariant to local affine contrast changes and investigate the resulting analysis scheme by performing classification and retrieval experiments on three texture databases. The obtained experimental results outperform the current state of the art in locally invariant texture analysis.

Journal ArticleDOI
TL;DR: This work proposes novel photometric and superpixel boundary consistency terms explicitly derived from superpixels and shows that they overcome many difficulties of standard pixel-based formulations and handle favorably problematic scenarios containing many repetitive structures and no or low textured regions.
Abstract: Urban environments possess many regularities which can be efficiently exploited for 3D dense reconstruction from multiple widely separated views. We present an approach utilizing properties of piecewise planarity and restricted number of plane orientations to suppress reconstruction and matching ambiguities causing failures of standard dense stereo methods. We formulate the problem of the 3D reconstruction in MRF framework built on an image pre-segmented into superpixels. Using this representation, we propose novel photometric and superpixel boundary consistency terms explicitly derived from superpixels and show that they overcome many difficulties of standard pixel-based formulations and handle favorably problematic scenarios containing many repetitive structures and no or low textured regions. We demonstrate our approach on several wide-baseline scenes demonstrating superior performance compared to previously proposed methods.