scispace - formally typeset
Search or ask a question

Showing papers in "International Journal of Computer Vision in 2014"


Journal ArticleDOI
Xudong Cao1, Yichen Wei1, Fang Wen1, Jian Sun1
TL;DR: A very efficient, highly accurate, “Explicit Shape Regression” approach for face alignment that significantly outperforms the state-of-the-art in terms of both accuracy and efficiency.
Abstract: We present a very efficient, highly accurate, "Explicit Shape Regression" approach for face alignment. Unlike previous regression-based approaches, we directly learn a vectorial regression function to infer the whole facial shape (a set of facial landmarks) from the image and explicitly minimize the alignment errors over the training data. The inherent shape constraint is naturally encoded into the regressor in a cascaded learning framework and applied from coarse to fine during the test, without using a fixed parametric shape model as in most previous methods. To make the regression more effective and efficient, we design a two-level boosted regression, shape indexed features and a correlation-based feature selection method. This combination enables us to learn accurate models from large training data in a short time (20 min for 2,000 training images), and run regression extremely fast in test (15 ms for a 87 landmarks shape). Experiments on challenging data show that our approach significantly outperforms the state-of-the-art in terms of both accuracy and efficiency.

1,239 citations


Journal ArticleDOI
TL;DR: It is discovered that “classical” flow formulations perform surprisingly well when combined with modern optimization and implementation techniques, and a new objective function is derived that formalizes the median filtering heuristic and develops a method that can better preserve motion details.
Abstract: The accuracy of optical flow estimation algorithms has been improving steadily as evidenced by results on the Middlebury optical flow benchmark. The typical formulation, however, has changed little since the work of Horn and Schunck. We attempt to uncover what has made recent advances possible through a thorough analysis of how the objective function, the optimization method, and modern implementation practices influence accuracy. We discover that "classical" flow formulations perform surprisingly well when combined with modern optimization and implementation techniques. One key implementation detail is the median filtering of intermediate flow fields during optimization. While this improves the robustness of classical methods it actually leads to higher energy solutions, meaning that these methods are not optimizing the original objective function. To understand the principles behind this phenomenon, we derive a new objective function that formalizes the median filtering heuristic. This objective function includes a non-local smoothness term that robustly integrates flow estimates over large spatial neighborhoods. By modifying this new term to include information about flow and image boundaries we develop a method that can better preserve motion details. To take advantage of the trend towards video in wide-screen format, we further introduce an asymmetric pyramid downsampling scheme that enables the estimation of longer range horizontal motions. The methods are evaluated on the Middlebury, MPI Sintel, and KITTI datasets using the same parameter settings.

623 citations


Journal ArticleDOI
TL;DR: This paper starts with canonical correlation analysis (CCA), a popular and successful approach for mapping visual and textual features to the same latent space, and incorporates a third view capturing high-level image semantics, represented either by a single category or multiple non-mutually-exclusive concepts.
Abstract: This paper investigates the problem of modeling Internet images and associated text or tags for tasks such as image-to-image search, tag-to-image search, and image-to-tag search (image annotation). We start with canonical correlation analysis (CCA), a popular and successful approach for mapping visual and textual features to the same latent space, and incorporate a third view capturing high-level image semantics, represented either by a single category or multiple non-mutually-exclusive concepts. We present two ways to train the three-view embedding: supervised, with the third view coming from ground-truth labels or search keywords; and unsupervised, with semantic themes automatically obtained by clustering the tags. To ensure high accuracy for retrieval tasks while keeping the learning process scalable, we combine multiple strong visual features and use explicit nonlinear kernel mappings to efficiently approximate kernel CCA. To perform retrieval, we use a specially designed similarity function in the embedded space, which substantially outperforms the Euclidean distance. The resulting system produces compelling qualitative results and outperforms a number of two-view baselines on retrieval tasks on three large-scale Internet image datasets.

612 citations


Journal ArticleDOI
TL;DR: The proposed FDDL model is extensively evaluated on various image datasets, and it shows superior performance to many state-of-the-art dictionary learning methods in a variety of classification tasks.
Abstract: The employed dictionary plays an important role in sparse representation or sparse coding based image reconstruction and classification, while learning dictionaries from the training data has led to state-of-the-art results in image classification tasks. However, many dictionary learning models exploit only the discriminative information in either the representation coefficients or the representation residual, which limits their performance. In this paper we present a novel dictionary learning method based on the Fisher discrimination criterion. A structured dictionary, whose atoms have correspondences to the subject class labels, is learned, with which not only the representation residual can be used to distinguish different classes, but also the representation coefficients have small within-class scatter and big between-class scatter. The classification scheme associated with the proposed Fisher discrimination dictionary learning (FDDL) model is consequently presented by exploiting the discriminative information in both the representation residual and the representation coefficients. The proposed FDDL model is extensively evaluated on various image datasets, and it shows superior performance to many state-of-the-art dictionary learning methods in a variety of classification tasks.

474 citations


Journal ArticleDOI
TL;DR: It is shown that when used as features for scene classification, zero shot learning, automatic image captioning, semantic image search, and parsing natural images, low dimensional scene attributes can compete with or improve on the state of the art performance.
Abstract: In this paper we present the first large-scale scene attribute database. First, we perform crowdsourced human studies to find a taxonomy of 102 discriminative attributes. We discover attributes related to materials, surface properties, lighting, affordances, and spatial layout. Next, we build the "SUN attribute database" on top of the diverse SUN categorical database. We use crowdsourcing to annotate attributes for 14,340 images from 707 scene categories. We perform numerous experiments to study the interplay between scene attributes and scene categories. We train and evaluate attribute classifiers and then study the feasibility of attributes as an intermediate scene representation for scene classification, zero shot learning, automatic image captioning, semantic image search, and parsing natural images. We show that when used as features for these tasks, low dimensional scene attributes can compete with or improve on the state of the art performance. The experiments suggest that scene attributes are an effective low-dimensional feature for capturing high-level context and semantics in scenes.

395 citations


Journal ArticleDOI
TL;DR: This paper comprehensively surveys the development of face hallucination, including both face super-resolution and face sketch-photo synthesis techniques, and presents a comparative analysis of representative methods and promising future directions.
Abstract: This paper comprehensively surveys the development of face hallucination (FH), including both face super-resolution and face sketch-photo synthesis techniques. Indeed, these two techniques share the same objective of inferring a target face image (e.g. high-resolution face image, face sketch and face photo) from a corresponding source input (e.g. low-resolution face image, face photo and face sketch). Considering the critical role of image interpretation in modern intelligent systems for authentication, surveillance, law enforcement, security control, and entertainment, FH has attracted growing attention in recent years. Existing FH methods can be grouped into four categories: Bayesian inference approaches, subspace learning approaches, a combination of Bayesian inference and subspace learning approaches, and sparse representation-based approaches. In spite of achieving a certain level of development, FH is limited in its success by complex application conditions such as variant illuminations, poses, or views. This paper provides a holistic understanding and deep insight into FH, and presents a comparative analysis of representative methods and promising future directions.

365 citations


Journal ArticleDOI
TL;DR: This paper proposes a maximum-margin framework for training temporal event detectors to recognize partial events, enabling early detection, based on Structured Output SVM, but extends it to accommodate sequential data.
Abstract: The need for early detection of temporal events from sequential data arises in a wide spectrum of applications ranging from human-robot interaction to video security. While temporal event detection has been extensively studied, early detection is a relatively unexplored problem. This paper proposes a maximum-margin framework for training temporal event detectors to recognize partial events, enabling early detection. Our method is based on Structured Output SVM, but extends it to accommodate sequential data. Experiments on datasets of varying complexity, for detecting facial expressions, hand gestures, and human activities, demonstrate the benefits of our approach.

360 citations


Journal ArticleDOI
TL;DR: This paper proposes a simple “prior-free” method for solving the non-rigid structure-from-motion (NRSfM) factorization problem, which involves solving a very small SDP of fixed size, and a nuclear-norm minimization problem.
Abstract: This paper proposes a simple "prior-free" method for solving the non-rigid structure-from-motion (NRSfM) factorization problem. Other than using the fundamental low-order linear combination model assumption, our method does not assume any extra prior knowledge either about the non-rigid structure or about the camera motions. Yet, it works effectively and reliably, producing optimal results, and not suffering from the inherent basis ambiguity issue which plagued most conventional NRSfM factorization methods. Our method is very simple to implement, which involves solving a very small SDP (semi-definite programming) of fixed size, and a nuclear-norm minimization problem. We also present theoretical analysis on the uniqueness and the relaxation gap of our solutions. Extensive experiments on both synthetic and real motion capture data (assuming following the low-order linear combination model) are conducted, which demonstrate that our method indeed outperforms most of the existing non-rigid factorization methods. This work offers not only new theoretical insight, but also a practical, everyday solution to NRSfM.

317 citations


Journal ArticleDOI
TL;DR: Extensive experiments on synthetic data, and important computer vision problems such as face recognition application and visual domain adaptation for object recognition demonstrate the superiority of the proposed approach over the existing, well-established methods.
Abstract: It is expensive to obtain labeled real-world visual data for use in training of supervised algorithms. Therefore, it is valuable to leverage existing databases of labeled data. However, the data in the source databases is often obtained under conditions that differ from those in the new task. Transfer learning provides techniques for transferring learned knowledge from a source domain to a target domain by finding a mapping between them. In this paper, we discuss a method for projecting both source and target data to a generalized subspace where each target sample can be represented by some combination of source samples. By employing a low-rank constraint during this transfer, the structure of source and target domains are preserved. This approach has three benefits. First, good alignment between the domains is ensured through the use of only relevant data in some subspace of the source domain in reconstructing the data in the target domain. Second, the discriminative power of the source domain is naturally passed on to the target domain. Third, noisy information will be filtered out during knowledge transfer. Extensive experiments on synthetic data, and important computer vision problems such as face recognition application and visual domain adaptation for object recognition demonstrate the superiority of the proposed approach over the existing, well-established methods.

293 citations


Journal ArticleDOI
TL;DR: This work presents an approach for live learning of object detectors, in which the system autonomously refines its models by actively requesting crowd-sourced annotations on images crawled from the Web, and introduces a novel part-based detector amenable to linear classifiers.
Abstract: Active learning and crowdsourcing are promising ways to efficiently build up training sets for object recognition, but thus far techniques are tested in artificially controlled settings. Typically the vision researcher has already determined the dataset's scope, the labels "actively" obtained are in fact already known, and/or the crowd-sourced collection process is iteratively fine-tuned. We present an approach for live learning of object detectors, in which the system autonomously refines its models by actively requesting crowd-sourced annotations on images crawled from the Web. To address the technical issues such a large-scale system entails, we introduce a novel part-based detector amenable to linear classifiers, and show how to identify its most uncertain instances in sub-linear time with a hashing-based solution. We demonstrate the approach with experiments of unprecedented scale and autonomy, and show it successfully improves the state-of-the-art for the most challenging objects in the PASCAL VOC benchmark. In addition, we show our detector competes well with popular nonlinear classifiers that are much more expensive to train.

273 citations


Journal ArticleDOI
TL;DR: A weakly-supervised cross-domain dictionary learning method is introduced, which learns a reconstructive, discriminative and domain-adaptive dictionary pair and the corresponding classifier parameters without using any prior information.
Abstract: We address the visual categorization problem and present a method that utilizes weakly labeled data from other visual domains as the auxiliary source data for enhancing the original learning system. The proposed method aims to expand the intra-class diversity of original training data through the collaboration with the source data. In order to bring the original target domain data and the auxiliary source domain data into the same feature space, we introduce a weakly-supervised cross-domain dictionary learning method, which learns a reconstructive, discriminative and domain-adaptive dictionary pair and the corresponding classifier parameters without using any prior information. Such a method operates at a high level, and it can be applied to different cross-domain applications. To build up the auxiliary domain data, we manually collect images from Web pages, and select human actions of specific categories from a different dataset. The proposed method is evaluated for human action recognition, image classification and event recognition tasks on the UCF YouTube dataset, the Caltech101/256 datasets and the Kodak dataset, respectively, achieving outstanding results.

Journal ArticleDOI
TL;DR: A 3D reconstruction and visualization system for automatically producing clean and well-regularized texture-mapped 3D models for large indoor scenes, from ground-level photographs and 3D laser points, with a new algorithm called “inverse constructive solid geometry (CSG)” for reconstructing a scene with a CSG representation consisting of volumetric primitives.
Abstract: Virtual exploration tools for large indoor environments (e.g. museums) have so far been limited to either blueprint-style 2D maps that lack photo-realistic views of scenes, or ground-level image-to-image transitions, which are immersive but ill-suited for navigation. On the other hand, photorealistic aerial maps would be a useful navigational guide for large indoor environments, but it is impossible to directly acquire photographs covering a large indoor environment from aerial viewpoints. This paper presents a 3D reconstruction and visualization system for automatically producing clean and well-regularized texture-mapped 3D models for large indoor scenes, from ground-level photographs and 3D laser points. The key component is a new algorithm called "inverse constructive solid geometry (CSG)" for reconstructing a scene with a CSG representation consisting of volumetric primitives, which imposes powerful regularization constraints. We also propose several novel techniques to adjust the 3D model to make it suitable for rendering the 3D maps from aerial viewpoints. The visualization system enables users to easily browse a large-scale indoor environment from a bird's-eye view, locate specific room interiors, fly into a place of interest, view immersive ground-level panorama views, and zoom out again, all with seamless 3D transitions. We demonstrate our system on various museums, including the Metropolitan Museum of Art in New York City--one of the largest art galleries in the world.

Journal ArticleDOI
TL;DR: This work observes that typical occlusions are due to overlaps between people and proposes a people detector tailored to various occlusion levels, and leverages the fact that person/person Occlusion result in very characteristic appearance patterns that can help to improve detection results.
Abstract: We consider the problem of detection and tracking of multiple people in crowded street scenes. State-of-the-art methods perform well in scenes with relatively few people, but are severely challenged by scenes with many subjects that partially occlude each other. This limitation is due to the fact that current people detectors fail when persons are strongly occluded. We observe that typical occlusions are due to overlaps between people and propose a people detector tailored to various occlusion levels. Instead of treating partial occlusions as distractions, we leverage the fact that person/person occlusions result in very characteristic appearance patterns that can help to improve detection results. We demonstrate the performance of our occlusion-aware person detector on a new dataset of people with controlled but severe levels of occlusion and on two challenging publicly available benchmarks outperforming single person detectors in each case.

Journal ArticleDOI
TL;DR: The SBM characterizes a strong model of shape, in that samples from the model look realistic and it can generalize to generate samples that differ from training examples, and it finds that the SBM learns distributions that are qualitatively and quantitatively better than existing models for this task.
Abstract: A good model of object shape is essential in applications such as segmentation, detection, inpainting and graphics. For example, when performing segmentation, local constraints on the shapes can help where object boundaries are noisy or unclear, and global constraints can resolve ambiguities where background clutter looks similar to parts of the objects. In general, the stronger the model of shape, the more performance is improved. In this paper, we use a type of deep Boltzmann machine (Salakhutdinov and Hinton, International Conference on Artificial Intelligence and Statistics, 2009) that we call a Shape Boltzmann Machine (SBM) for the task of modeling foreground/background (binary) and parts-based (categorical) shape images. We show that the SBM characterizes a strong model of shape, in that samples from the model look realistic and it can generalize to generate samples that differ from training examples. We find that the SBM learns distributions that are qualitatively and quantitatively better than existing models for this task.

Journal ArticleDOI
TL;DR: An effective method to expose region splicing by revealing inconsistencies in local noise levels, based on the fact that images of different origins may have different noise characteristics introduced by the sensors or post-processing steps.
Abstract: Region splicing is a simple and common digital image tampering operation, where a chosen region from one image is composited into another image with the aim to modify the original image's content. In this paper, we describe an effective method to expose region splicing by revealing inconsistencies in local noise levels, based on the fact that images of different origins may have different noise characteristics introduced by the sensors or post-processing steps. The basis of our region splicing detection method is a new blind noise estimation algorithm, which exploits a particular regular property of the kurtosis of nature images in band-pass domains and the relationship between noise characteristics and kurtosis. The estimation of noise statistics is formulated as an optimization problem with closed-form solution, and is further extended to an efficient estimation method of local noise statistics. We demonstrate the efficacy of our blind global and local noise estimation methods on natural images, and evaluate the performances and robustness of the region splicing detection method on forged images.

Journal ArticleDOI
TL;DR: In the case where all nodes share a common state space and the smoothness prior favours equal values, it is shown that unifying the two approaches yields a new algorithm, PMBP, which is more accurate than PM and orders of magnitude faster than MP-PBP.
Abstract: PatchMatch (PM) is a simple, yet very powerful and successful method for optimizing continuous labelling problems. The algorithm has two main ingredients: the update of the solution space by sampling and the use of the spatial neighbourhood to propagate samples. We show how these ingredients are related to steps in a specific form of belief propagation (BP) in the continuous space, called max-product particle BP (MP-PBP). However, MP-PBP has thus far been too slow to allow complex state spaces. In the case where all nodes share a common state space and the smoothness prior favours equal values, we show that unifying the two approaches yields a new algorithm, PMBP, which is more accurate than PM and orders of magnitude faster than MP-PBP. To illustrate the benefits of our PMBP method we have built a new stereo matching algorithm with unary terms which are borrowed from the recent PM Stereo work and novel realistic pairwise terms that provide smoothness. We have experimentally verified that our method is an improvement over state-of-the-art techniques at sub-pixel accuracy level.

Journal ArticleDOI
TL;DR: The objective of this work is to determine if people are interacting in TV video by detecting whether they are looking at each other or not and to determine both the temporal period of the interaction and also spatially localize the relevant people.
Abstract: The objective of this work is to determine if people are interacting in TV video by detecting whether they are looking at each other or not. We determine both the temporal period of the interaction and also spatially localize the relevant people. We make the following four contributions: (i) head detection with implicit coarse pose information (front, profile, back); (ii) continuous head pose estimation in unconstrained scenarios (TV video) using Gaussian process regression; (iii) propose and evaluate several methods for assessing whether and when pairs of people are looking at each other in a video shot; and (iv) introduce new ground truth annotation for this task, extending the TV human interactions dataset (Patron-Perez et al. 2010) The performance of the methods is evaluated on this dataset, which consists of 300 video clips extracted from TV shows. Despite the variety and difficulty of this video material, our best method obtains an average precision of 87.6 % in a fully automatic manner.

Journal ArticleDOI
TL;DR: It is demonstrated that observing people performing different actions can significantly improve estimates of 3D scene geometry and is validated on monocular time-lapse sequences from YouTube and still images of indoor scenes gathered from the Internet.
Abstract: We present an approach which exploits the coupling between human actions and scene geometry to use human pose as a cue for single-view 3D scene understanding. Our method builds upon recent advances in still-image pose estimation to extract functional and geometric constraints on the scene. These constraints are then used to improve single-view 3D scene understanding approaches. The proposed method is validated on monocular time-lapse sequences from YouTube and still images of indoor scenes gathered from the Internet. We demonstrate that observing people performing different actions can significantly improve estimates of 3D scene geometry.

Journal ArticleDOI
TL;DR: This paper analyzes the novel concept of object bank, a high-level image representation encoding object appearance and spatial location information in images, and demonstrates that object bank is a high level representation, from which it can easily discover semantic information of unknown images.
Abstract: It is a remarkable fact that images are related to objects constituting them. In this paper, we propose to represent images by using objects appearing in them. We introduce the novel concept of object bank (OB), a high-level image representation encoding object appearance and spatial location information in images. OB represents an image based on its response to a large number of pre-trained object detectors, or `object filters', blind to the testing dataset and visual recognition task. Our OB representation demonstrates promising potential in high level image recognition tasks. It significantly outperforms traditional low level image representations in image classification on various benchmark image datasets by using simple, off-the-shelf classification algorithms such as linear SVM and logistic regression. In this paper, we analyze OB in detail, explaining our design choice of OB for achieving its best potential on different types of datasets. We demonstrate that object bank is a high level representation, from which we can easily discover semantic information of unknown images. We provide guidelines for effectively applying OB to high level image recognition tasks where it could be easily compressed for efficient computation in practice and is very robust to various classifiers.

Journal ArticleDOI
TL;DR: This paper presents a method to build rotation-invariant HOG descriptors using Fourier analysis in polar/spherical coordinates, which are closely related to the irreducible representation of the 2D/3D rotation groups.
Abstract: The histogram of oriented gradients (HOG) is widely used for image description and proves to be very effective. In many vision problems, rotation-invariant analysis is necessary or preferred. Popular solutions are mainly based on pose normalization or learning, neglecting some intrinsic properties of rotations. This paper presents a method to build rotation-invariant HOG descriptors using Fourier analysis in polar/spherical coordinates, which are closely related to the irreducible representation of the 2D/3D rotation groups. This is achieved by considering a gradient histogram as a continuous angular signal which can be well represented by the Fourier basis (2D) or spherical harmonics (3D). As rotation-invariance is established in an analytical way, we can avoid discretization artifacts and create a continuous mapping from the image to the feature space. In the experiments, we first show that our method outperforms the state-of-the-art in a public dataset for a car detection task in aerial images. We further use the Princeton Shape Benchmark and the SHREC 2009 Generic Shape Benchmark to demonstrate the high performance of our method for similarity measures of 3D shapes. Finally, we show an application on microscopic volumetric data.

Journal ArticleDOI
TL;DR: This paper investigates the visual-inertial structure from motion problem and shows that the problem can have a unique solution, two distinct solutions and infinite solutions depending on the trajectory, on the number of point-features and on their layout and on thenumber of camera images.
Abstract: This paper investigates the visual-inertial structure from motion problem. A simple closed form solution to this problem is introduced. Special attention is devoted to identify the conditions under which the problem has a finite number of solutions. Specifically, it is shown that the problem can have a unique solution, two distinct solutions and infinite solutions depending on the trajectory, on the number of point-features and on their layout and on the number of camera images. The investigation is also performed in the case when the inertial data are biased, showing that, in this latter case, more images and more restrictive conditions on the trajectory are required for the problem resolvability.

Journal ArticleDOI
TL;DR: A generic junction analysis scheme that yields a characterization of L-, Y- and X- junctions, including a precise computation of their type, localization and scale, using a statistical modeling of suitably normalized gray level gradients.
Abstract: Accurate junction detection and characterization are of primary importance for several aspects of scene analysis, including depth recovery and motion analysis. In this work, we introduce a generic junction analysis scheme. The first asset of the proposed procedure is an automatic criterion for the detection of junctions, permitting to deal with textured parts in which no detection is expected. Second, the method yields a characterization of L-, Y- and X- junctions, including a precise computation of their type, localization and scale. Contrary to classical approaches, scale characterization does not rely on the linear scale-space. First, an a contrario approach is used to compute the meaningfulness of a junction. This approach relies on a statistical modeling of suitably normalized gray level gradients. Then, exclusion principles between junctions permit their precise characterization. We give implementation details for this procedure and evaluate its efficiency through various experiments.

Journal ArticleDOI
TL;DR: This work proposes to automatically populate ImageNet with pixelwise object-background segmentations, by leveraging existing manual annotations in the form of class labels and bounding-boxes, to recursively exploit images segmented so far to guide the segmentation of new images.
Abstract: ImageNet is a large-scale hierarchical database of object classes with millions of images.We propose to automatically populate it with pixelwise object-background segmentations, by leveraging existing manual annotations in the form of class labels and bounding-boxes. The key idea is to recursively exploit images segmented so far to guide the segmentation of new images. At each stage this propagation process expands into the images which are easiest to segment at that point in time, e.g. by moving to the semantically most related classes to those segmented so far. The propagation of segmentation occurs both (a) at the image level, by transferring existing segmentations to estimate the probability of a pixel to be foreground, and (b) at the class level, by jointly segmenting images of the same class and by importing the appearance models of classes that are already segmented. Through experiments on 577 classes and 500k images we show that our technique (i) annotates a wide range of classes with accurate segmentations; (ii) effectively exploits the hierarchical structure of ImageNet; (iii) scales efficiently, especially when implemented on superpixels; (iv) outperforms a baseline GrabCut (Rother et al. 2004) initialized on the image center, as well as segmentation transfer from a fixed source pool and run independently on each target image (Kuettel and Ferrari 2012). Moreover, our method also delivers state-of-the-art results on the recent iCoseg dataset for co-segmentation.

Journal ArticleDOI
TL;DR: The techniques are demonstrated on joint stereo and object labelling problems, as well as object class segmentation, showing in addition for joint object-stereo labelling how the method provides an efficient approach to inference in product label-spaces.
Abstract: Recently, a number of cross bilateral filtering methods have been proposed for solving multi-label problems in computer vision, such as stereo, optical flow and object class segmentation that show an order of magnitude improvement in speed over previous methods. These methods have achieved good results despite using models with only unary and/or pairwise terms. However, previous work has shown the value of using models with higher-order terms e.g. to represent label consistency over large regions, or global co-occurrence relations. We show how these higher-order terms can be formulated such that filter-based inference remains possible. We demonstrate our techniques on joint stereo and object labelling problems, as well as object class segmentation, showing in addition for joint object-stereo labelling how our method provides an efficient approach to inference in product label-spaces. We show that we are able to speed up inference in these models around 10---30 times with respect to competing graph-cut/move-making methods, as well as maintaining or improving accuracy in all cases. We show results on PascalVOC-10 for object class segmentation, and Leuven for joint object-stereo labelling.

Journal ArticleDOI
TL;DR: This paper introduces a spectral divisive clustering algorithm to efficiently extract a hierarchy over a large number of tracklets and provides an efficient positive definite kernel that computes the structural and visual similarity of two hierarchical decompositions by relying on models of their parent–child relations.
Abstract: Complex activities, e.g. pole vaulting, are composed of a variable number of sub-events connected by complex spatio-temporal relations, whereas simple actions can be represented as sequences of short temporal parts. In this paper, we learn hierarchical representations of activity videos in an unsupervised manner. These hierarchies of mid-level motion components are data-driven decompositions specific to each video. We introduce a spectral divisive clustering algorithm to efficiently extract a hierarchy over a large number of tracklets (i.e. local trajectories). We use this structure to represent a video as an unordered binary tree. We model this tree using nested histograms of local motion features. We provide an efficient positive definite kernel that computes the structural and visual similarity of two hierarchical decompositions by relying on models of their parent---child relations. We present experimental results on four recent challenging benchmarks: the High Five dataset (Patron-Perez et al., High five: recognising human interactions in TV shows, 2010), the Olympics Sports dataset (Niebles et al., Modeling temporal structure of decomposable motion segments for activity classification, 2010), the Hollywood 2 dataset (Marszalek et al., Actions in context, 2009), and the HMDB dataset (Kuehne et al., HMDB: A large video database for human motion recognition, 2011). We show that per-video hierarchies provide additional information for activity recognition. Our approach improves over unstructured activity models, baselines using other motion decomposition algorithms, and the state of the art.

Journal ArticleDOI
TL;DR: A unified flexible model for both supervised and semi-supervised learning that allows us to learn transformations between domains and develops a joint learning framework for adaptive classifiers, which outperforms competing methods in terms of multi-class accuracy and scalability.
Abstract: -1We address the problem of visual domain adaptation for transferring object models from one dataset or visual domain to another. We introduce a unified flexible model for both supervised and semi-supervised learning that allows us to learn transformations between domains. Additionally, we present two instantiations of the model, one for general feature adaptation/alignment, and one specifically designed for classification. First, we show how to extend metric learning methods for domain adaptation, allowing for learning metrics independent of the domain shift and the final classifier used. Furthermore, we go beyond classical metric learning by extending the method to asymmetric, category independent transformations. Our framework can adapt features even when the target domain does not have any labeled examples for some categories, and when the target and source features have different dimensions. Finally, we develop a joint learning framework for adaptive classifiers, which outperforms competing methods in terms of multi-class accuracy and scalability. We demonstrate the ability of our approach to adapt object recognition models under a variety of situations, such as differing imaging conditions, feature types, and codebooks. The experiments show its strong performance compared to previous approaches and its applicability to large-scale scenarios.

Journal ArticleDOI
TL;DR: Ininitesimal Plane-based Pose Estimation (IPPE), because one can think of it as solving pose using the transform about an infinitesimally small region on the surface, is presented, which is both extremely fast and allows to fully characterise the method in terms of degeneracies, number of returned solutions, and the geometric relationship of these solutions.
Abstract: Estimating the pose of a plane given a set of point correspondences is a core problem in computer vision with many applications including Augmented Reality (AR), camera calibration and 3D scene reconstruction and interpretation. Despite much progress over recent years there is still the need for a more efficient and more accurate solution, particularly in mobile applications where the run-time budget is critical. We present a new analytic solution to the problem which is far faster than current methods based on solving Pose from $$n$$ n Points (PnP) and is in most cases more accurate. Our approach involves a new way to exploit redundancy in the homography coefficients. This uses the fact that when the homography is noisy it will estimate the true transform between the model plane and the image better at some regions on the plane than at others. Our method is based on locating a point where the transform is best estimated, and using only the local transformation at that point to constrain pose. This involves solving pose with a local non-redundant 1st-order PDE. We call this framework Infinitesimal Plane-based Pose Estimation (IPPE), because one can think of it as solving pose using the transform about an infinitesimally small region on the surface. We show experimentally that IPPE leads to very accurate pose estimates. Because IPPE is analytic it is both extremely fast and allows us to fully characterise the method in terms of degeneracies, number of returned solutions, and the geometric relationship of these solutions. This characterisation is not possible with state-of-the-art PnP methods.

Journal ArticleDOI
TL;DR: A new framework for capturing large and complex deformations in image registration and atlas construction is presented, which introduces to this effect a new direct feature matching technique that finds global correspondences between images via simple nearest-neighbor searches.
Abstract: This paper presents a new framework for capturing large and complex deformations in image registration and atlas construction. This challenging and recurrent problem in computer vision and medical imaging currently relies on iterative and local approaches, which are prone to local minima and, therefore, limit present methods to relatively small deformations. Our general framework introduces to this effect a new direct feature matching technique that finds global correspondences between images via simple nearest-neighbor searches. More specifically, very large image deformations are captured in Spectral Forces, which are derived from an improved graph spectral representation. We illustrate the benefits of our framework through a new enhanced version of the popular Log-Demons algorithm, named the Spectral Log-Demons, as well as through a groupwise extension, named the Groupwise Spectral Log-Demons, which is relevant for atlas construction. The evaluations of these extended versions demonstrate substantial improvements in accuracy and robustness to large deformations over the conventional Demons approaches.

Journal ArticleDOI
TL;DR: The online CRF approach is more powerful at distinguishing spatially close targets with similar appearances, as well as in tracking targets in presence of camera motions, and an efficient algorithm is introduced for finding an association with low energy cost.
Abstract: We introduce an online learning approach for multi-target tracking. Detection responses are gradually associated into tracklets in multiple levels to produce final tracks. Unlike most previous approaches which only focus on producing discriminative motion and appearance models for all targets, we further consider discriminative features for distinguishing difficult pairs of targets. The tracking problem is formulated using an online learned CRF model, and is transformed into an energy minimization problem. The energy functions include a set of unary functions that are based on motion and appearance models for discriminating all targets, as well as a set of pairwise functions that are based on models for differentiating corresponding pairs of tracklets. The online CRF approach is more powerful at distinguishing spatially close targets with similar appearances, as well as in tracking targets in presence of camera motions. An efficient algorithm is introduced for finding an association with low energy cost. We present results on four public data sets, and show significant improvements compared with several state-of-art methods.

Journal ArticleDOI
TL;DR: This paper addresses the challenging scenario of unsupervised domain adaptation, where the target domain does not provide any annotated data to assist in adapting the classifier.
Abstract: Domain adaptation aims to correct the mismatch in statistical properties between the source domain on which a classifier is trained and the target domain to which the classifier is to be applied. In this paper, we address the challenging scenario of unsupervised domain adaptation, where the target domain does not provide any annotated data to assist in adapting the classifier. Our strategy is to learn robust features which are resilient to the mismatch across domains and then use them to construct classifiers that will perform well on the target domain. To this end, we propose novel kernel learning approaches to infer such features for adaptation. Concretely, we explore two closely related directions. In the first direction, we propose unsupervised learning of a geodesic flow kernel (GFK). The GFK summarizes the inner products in an infinite sequence of feature subspaces that smoothly interpolates between the source and target domains. In the second direction, we propose supervised learning of a kernel that discriminatively combines multiple base GFKs. Those base kernels model the source and the target domains at fine-grained granularities. In particular, each base kernel pivots on a different set of landmarks--the most useful data instances that reveal the similarity between the source and the target domains, thus bridging them to achieve adaptation. Our approaches are computationally convenient, automatically infer important hyper-parameters, and are capable of learning features and classifiers discriminatively without demanding labeled data from the target domain. In extensive empirical studies on standard benchmark recognition datasets, our appraches yield state-of-the-art results compared to a variety of competing methods.