scispace - formally typeset
Search or ask a question

Showing papers in "International Journal of Computer Vision in 2015"


Journal ArticleDOI
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
Abstract: The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

30,811 citations


Journal ArticleDOI
TL;DR: A review of the Pascal Visual Object Classes challenge from 2008-2012 and an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.
Abstract: The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008---2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms; and, challenge designers, who want to see what we as organisers have learnt from the process and our recommendations for the organisation of future challenges. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community's progress through time using the methods of Hoiem et al. (Proceedings of European Conference on Computer Vision, 2012) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.

6,061 citations


Journal ArticleDOI
TL;DR: A novel approach for converting a deep CNN into a SNN that enables mapping CNN to spike-based hardware architectures and evaluates the resulting SNN on publicly available Defense Advanced Research Projects Agency (DARPA) Neovision2 Tower and CIFAR-10 datasets and shows similar object recognition accuracy as the original CNN.
Abstract: Deep-learning neural networks such as convolutional neural network (CNN) have shown great potential as a solution for difficult vision problems, such as object recognition. Spiking neural networks (SNN)-based architectures have shown great potential as a solution for realizing ultra-low power consumption using spike-based neuromorphic hardware. This work describes a novel approach for converting a deep CNN into a SNN that enables mapping CNN to spike-based hardware architectures. Our approach first tailors the CNN architecture to fit the requirements of SNN, then trains the tailored CNN in the same way as one would with CNN, and finally applies the learned network weights to an SNN architecture derived from the tailored CNN. We evaluate the resulting SNN on publicly available Defense Advanced Research Projects Agency (DARPA) Neovision2 Tower and CIFAR-10 datasets and show similar object recognition accuracy as the original CNN. Our SNN implementation is amenable to direct mapping to spike-based neuromorphic hardware, such as the ones being developed under the DARPA SyNAPSE program. Our hardware mapping analysis suggests that SNN implementation on such spike-based hardware is two orders of magnitude more energy-efficient than the original CNN implementation on off-the-shelf FPGA-based hardware.

695 citations


Journal ArticleDOI
TL;DR: This paper proposes a semi-supervised batch mode multi-class active learning algorithm for visual concept recognition that exploits the whole active pool to evaluate the uncertainty of the data, and proposes to make the selected data as diverse as possible.
Abstract: As a way to relieve the tedious work of manual annotation, active learning plays important roles in many applications of visual concept recognition. In typical active learning scenarios, the number of labelled data in the seed set is usually small. However, most existing active learning algorithms only exploit the labelled data, which often suffers from over-fitting due to the small number of labelled examples. Besides, while much progress has been made in binary class active learning, little research attention has been focused on multi-class active learning. In this paper, we propose a semi-supervised batch mode multi-class active learning algorithm for visual concept recognition. Our algorithm exploits the whole active pool to evaluate the uncertainty of the data. Considering that uncertain data are always similar to each other, we propose to make the selected data as diverse as possible, for which we explicitly impose a diversity constraint on the objective function. As a multi-class active learning algorithm, our algorithm is able to exploit uncertainty across multiple classes. An efficient algorithm is used to optimize the objective function. Extensive experiments on action recognition, object classification, scene recognition, and event detection demonstrate its advantages.

401 citations


Journal ArticleDOI
TL;DR: This work provides a probabilistic model of the target variations over time and rigorously shows that this model is a special case of the Earth Mover’s Distance optimization problem where the ground distance is governed by some underlying noise model.
Abstract: Locally Orderless Tracking (LOT) is a visual tracking algorithm that automatically estimates the amount of local (dis)order in the target This lets the tracker specialize in both rigid and deformable objects on-line and with no prior assumptions We provide a probabilistic model of the target variations over time We then rigorously show that this model is a special case of the Earth Mover's Distance optimization problem where the ground distance is governed by some underlying noise model This noise model has several parameters that control the cost of moving pixels and changing their color We develop two such noise models and demonstrate how their parameters can be estimated on-line during tracking to account for the amount of local (dis)order in the target We also discuss the significance of this on-line parameter update and demonstrate its contribution to the performance Finally we show LOT's tracking capabilities on challenging video sequences, both commonly used and new, displaying performance comparable to state-of-the-art methods

383 citations


Journal ArticleDOI
TL;DR: A novel superpixelwise convolutional neural network approach, called SuperCNN, is proposed to learn the internal representations of saliency in an efficient manner, and can robustly detect salient objects and outperforms the state-of-the-art methods on three benchmark datasets.
Abstract: Existing computational models for salient object detection primarily rely on hand-crafted features, which are only able to capture low-level contrast information. In this paper, we learn the hierarchical contrast features by formulating salient object detection as a binary labeling problem using deep learning techniques. A novel superpixelwise convolutional neural network approach, called SuperCNN, is proposed to learn the internal representations of saliency in an efficient manner. In contrast to the classical convolutional networks, SuperCNN has four main properties. First, the proposed method is able to learn the hierarchical contrast features, as it is fed by two meaningful superpixel sequences, which is much more effective for detecting salient regions than feeding raw image pixels. Second, as SuperCNN recovers the contextual information among superpixels, it enables large context to be involved in the analysis efficiently. Third, benefiting from the superpixelwise mechanism, the required number of predictions for a densely labeled map is hugely reduced. Fourth, saliency can be detected independent of region size by utilizing a multiscale network structure. Experiments show that SuperCNN can robustly detect salient objects and outperforms the state-of-the-art methods on three benchmark datasets.

263 citations


Journal ArticleDOI
TL;DR: This paper proposes a consistent low-rank sparse tracker (CLRST) that builds upon the particle filter framework for tracking that adaptively prunes and selects candidate particles by using linear sparse combinations of dictionary templates.
Abstract: Object tracking is the process of determining the states of a target in consecutive video frames based on properties of motion and appearance consistency. In this paper, we propose a consistent low-rank sparse tracker (CLRST) that builds upon the particle filter framework for tracking. By exploiting temporal consistency, the proposed CLRST algorithm adaptively prunes and selects candidate particles. By using linear sparse combinations of dictionary templates, the proposed method learns the sparse representations of image regions corresponding to candidate particles jointly by exploiting the underlying low-rank constraints. In addition, the proposed CLRST algorithm is computationally attractive since temporal consistency property helps prune particles and the low-rank minimization problem for learning joint sparse representations can be efficiently solved by a sequence of closed form update operations. We evaluate the proposed CLRST algorithm against $$14$$ 14 state-of-the-art tracking methods on a set of $$25$$ 25 challenging image sequences. Experimental results show that the CLRST algorithm performs favorably against state-of-the-art tracking methods in terms of accuracy and execution time.

255 citations


Journal ArticleDOI
TL;DR: This paper addresses the problems of contour detection, bottom-up grouping, object detection and semantic segmentation on RGB-D data, and proposes an approach that classifies superpixels into the dominant object categories in the NYUD2 dataset.
Abstract: In this paper, we address the problems of contour detection, bottom-up grouping, object detection and semantic segmentation on RGB-D data. We focus on the challenging setting of cluttered indoor scenes, and evaluate our approach on the recently introduced NYU-Depth V2 (NYUD2) dataset (Silberman et al., ECCV, 2012). We propose algorithms for object boundary detection and hierarchical segmentation that generalize the $$gPb-ucm$$gPb-ucm approach of Arbelaez et al. (TPAMI, 2011) by making effective use of depth information. We show that our system can label each contour with its type (depth, normal or albedo). We also propose a generic method for long-range amodal completion of surfaces and show its effectiveness in grouping. We train RGB-D object detectors by analyzing and computing histogram of oriented gradients on the depth image and using them with deformable part models (Felzenszwalb et al., TPAMI, 2010). We observe that this simple strategy for training object detectors significantly outperforms more complicated models in the literature. We then turn to the problem of semantic segmentation for which we propose an approach that classifies superpixels into the dominant object categories in the NYUD2 dataset. We design generic and class-specific features to encode the appearance and geometry of objects. We also show that additional features computed from RGB-D object detectors and scene classifiers further improves semantic segmentation accuracy. In all of these tasks, we report significant improvements over the state-of-the-art.

253 citations


Journal ArticleDOI
TL;DR: This work proposes to represent the dynamic scene as a collection of rigidly moving planes, into which the input images are segmented, and shows that a view-consistent multi-frame scheme significantly improves accuracy, especially in the presence of occlusions, and increases robustness against adverse imaging conditions.
Abstract: 3D scene flow estimation aims to jointly recover dense geometry and 3D motion from stereoscopic image sequences, thus generalizes classical disparity and 2D optical flow estimation. To realize its conceptual benefits and overcome limitations of many existing methods, we propose to represent the dynamic scene as a collection of rigidly moving planes, into which the input images are segmented. Geometry and 3D motion are then jointly recovered alongside an over-segmentation of the scene. This piecewise rigid scene model is significantly more parsimonious than conventional pixel-based representations, yet retains the ability to represent real-world scenes with independent object motion. It, furthermore, enables us to define suitable scene priors, perform occlusion reasoning, and leverage discrete optimization schemes toward stable and accurate results. Assuming the rigid motion to persist approximately over time additionally enables us to incorporate multiple frames into the inference. To that end, each view holds its own representation, which is encouraged to be consistent across all other viewpoints and frames in a temporal window. We show that such a view-consistent multi-frame scheme significantly improves accuracy, especially in the presence of occlusions, and increases robustness against adverse imaging conditions. Our method currently achieves leading performance on the KITTI benchmark, for both flow and stereo.

244 citations


Journal ArticleDOI
TL;DR: In this article, the authors present an empirical comparison of 27 state-of-the-art optimization techniques on a corpus of 2453 energy minimization instances from diverse applications in computer vision.
Abstract: Szeliski et al. published an influential study in 2006 on energy minimization methods for Markov random fields. This study provided valuable insights in choosing the best optimization technique for certain classes of problems. While these insights remain generally useful today, the phenomenal success of random field models means that the kinds of inference problems that have to be solved changed significantly. Specifically, the models today often include higher order interactions, flexible connectivity structures, large label-spaces of different cardinalities, or learned energy tables. To reflect these changes, we provide a modernized and enlarged study. We present an empirical comparison of more than 27 state-of-the-art optimization techniques on a corpus of 2453 energy minimization instances from diverse applications in computer vision. To ensure reproducibility, we evaluate all methods in the OpenGM 2 framework and report extensive results regarding runtime and solution quality. Key insights from our study agree with the results of Szeliski et al. for the types of models they studied. However, on new and challenging types of models our findings disagree and suggest that polyhedral methods and integer programming solvers are competitive in terms of runtime and solution quality over a large range of model types.

218 citations


Journal ArticleDOI
TL;DR: A nonlocal extension of Gaussian scale mixture (GSM) model is developed using simultaneous sparse coding (SSC) and its applications into image restoration are explored and it is shown that the variances of sparse coefficients can be jointly estimated along with the unknown sparse coefficients via the method of alternating optimization.
Abstract: In image processing, sparse coding has been known to be relevant to both variational and Bayesian approaches. The regularization parameter in variational image restoration is intrinsically connected with the shape parameter of sparse coefficients' distribution in Bayesian methods. How to set those parameters in a principled yet spatially adaptive fashion turns out to be a challenging problem especially for the class of nonlocal image models. In this work, we propose a structured sparse coding framework to address this issue--more specifically, a nonlocal extension of Gaussian scale mixture (GSM) model is developed using simultaneous sparse coding (SSC) and its applications into image restoration are explored. It is shown that the variances of sparse coefficients (the field of scalar multipliers of Gaussians)--if treated as a latent variable--can be jointly estimated along with the unknown sparse coefficients via the method of alternating optimization. When applied to image restoration, our experimental results have shown that the proposed SSC---GSM technique can both preserve the sharpness of edges and suppress undesirable artifacts. Thanks to its capability of achieving a better spatial adaptation, SSC---GSM based image restoration often delivers reconstructed images with higher subjective/objective qualities than other competing approaches.

Journal ArticleDOI
TL;DR: The promising performance of the proposed approach in image denoising is shown, which compares quite favorably with approaches involving a single learned square transform or an overcomplete synthesis dictionary, or gaussian mixture models.
Abstract: In recent years, sparse signal modeling, especially using the synthesis model has been popular. Sparse coding in the synthesis model is however, NP-hard. Recently, interest has turned to the sparsifying transform model, for which sparse coding is cheap. However, natural images typically contain diverse textures that cannot be sparsified well by a single transform. Hence, in this work, we propose a union of sparsifying transforms model. Sparse coding in this model reduces to a form of clustering. The proposed model is also equivalent to a structured overcomplete sparsifying transform model with block cosparsity, dubbed OCTOBOS. The alternating algorithm introduced for learning such transforms involves simple closed-form solutions. A theoretical analysis provides a convergence guarantee for this algorithm. It is shown to be globally convergent to the set of partial minimizers of the non-convex learning problem. We also show that under certain conditions, the algorithm converges to the set of stationary points of the overall objective. When applied to images, the algorithm learns a collection of well-conditioned square transforms, and a good clustering of patches or textures. The resulting sparse representations for the images are much better than those obtained with a single learned transform, or with analytical transforms. We show the promising performance of the proposed approach in image denoising, which compares quite favorably with approaches involving a single learned square transform or an overcomplete synthesis dictionary, or gaussian mixture models. The proposed denoising method is also faster than the synthesis dictionary based approach.

Journal ArticleDOI
TL;DR: From a variety of experiments that are performed on output videos, it is shown that the proposed technique performs better than state-of-the-art techniques.
Abstract: In the context of extracting information from video, bad weather conditions like rain can have a detrimental effect. In this paper, a novel framework to detect and remove rain streaks from video is proposed. The first part of the proposed framework for rain removal is a technique to detect rain streaks based on phase congruency features. The variation of features from frame to frame is used to estimate the candidate rain pixels in a frame. In order to reduce the number of false candidates due to global motion, frames are registered using phase correlation. The second part of the proposed framework is a novel reconstruction technique that utilizes information from three different sources, which are intensities of the rain affected pixel, spatial neighbors, and temporal neighbors. An optimal estimate for the actual intensity of the rain affected pixel is made based on the minimization of registration error between frames. An optical flow technique using local phase information is adopted for registration. This part of the proposed framework for removing rain is modeled such that the presence of local motion will not distort the features in the reconstructed video. The proposed framework is evaluated quantitatively and qualitatively on a variety of videos with varying complexities. The effectiveness of the algorithm is quantitatively verified by computing a no-reference image quality measure on individual frames of the reconstructed video. From a variety of experiments that are performed on output videos, it is shown that the proposed technique performs better than state-of-the-art techniques.

Journal ArticleDOI
TL;DR: A robust and fast to evaluate energy function is defined, based on enforcing color similarity between the boundaries and the superpixel color histogram, which is able to achieve a performance comparable to the state-of-the-art, but in real-time on a single Intel i7 CPU at 2.8 GHz.
Abstract: Superpixel algorithms aim to over-segment the image by grouping pixels that belong to the same object. Many state-of-the-art superpixel algorithms rely on minimizing objective functions to enforce color homogeneity. The optimization is accomplished by sophisticated methods that progressively build the superpixels, typically by adding cuts or growing superpixels. As a result, they are computationally too expensive for real-time applications. We introduce a new approach based on a simple hill-climbing optimization. Starting from an initial superpixel partitioning, it continuously refines the superpixels by modifying the boundaries. We define a robust and fast to evaluate energy function, based on enforcing color similarity between the boundaries and the superpixel color histogram. In a series of experiments, we show that we achieve an excellent compromise between accuracy and efficiency. We are able to achieve a performance comparable to the state-of-the-art, but in real-time on a single Intel i7 CPU at 2.8 GHz.

Journal ArticleDOI
TL;DR: In this paper, the authors developed region cues indicative of high-level saliency in egocentric video and learned a regressor to predict the relative importance of any new region based on these cues.
Abstract: We present a video summarization approach for egocentric or "wearable" camera data. Given hours of video, the proposed method produces a compact storyboard summary of the camera wearer's day. In contrast to traditional keyframe selection techniques, the resulting summary focuses on the most important objects and people with which the camera wearer interacts. To accomplish this, we develop region cues indicative of high-level saliency in egocentric video--such as the nearness to hands, gaze, and frequency of occurrence--and learn a regressor to predict the relative importance of any new region based on these cues. Using these predictions and a simple form of temporal event detection, our method selects frames for the storyboard that reflect the key object-driven happenings. We adjust the compactness of the final summary given either an importance selection criterion or a length budget; for the latter, we design an efficient dynamic programming solution that accounts for importance, visual uniqueness, and temporal displacement. Critically, the approach is neither camera-wearer-specific nor object-specific; that means the learned importance metric need not be trained for a given user or context, and it can predict the importance of objects and people that have never been seen previously. Our results on two egocentric video datasets show the method's promise relative to existing techniques for saliency and summarization.

Journal ArticleDOI
TL;DR: This paper significantly extends the SIFT-like matching framework to mesh data and proposes a novel approach using fine-grained matching of 3D keypoint descriptors, which accounts for the average reconstruction error of probe face descriptors sparsely represented by a large dictionary of gallery descriptors in identification.
Abstract: Registration algorithms performed on point clouds or range images of face scans have been successfully used for automatic 3D face recognition under expression variations, but have rarely been investigated to solve pose changes and occlusions mainly since that the basic landmarks to initialize coarse alignment are not always available. Recently, local feature-based SIFT-like matching proves competent to handle all such variations without registration. In this paper, towards 3D face recognition for real-life biometric applications, we significantly extend the SIFT-like matching framework to mesh data and propose a novel approach using fine-grained matching of 3D keypoint descriptors. First, two principal curvature-based 3D keypoint detectors are provided, which can repeatedly identify complementary locations on a face scan where local curvatures are high. Then, a robust 3D local coordinate system is built at each keypoint, which allows extraction of pose-invariant features. Three keypoint descriptors, corresponding to three surface differential quantities, are designed, and their feature-level fusion is employed to comprehensively describe local shapes of detected keypoints. Finally, we propose a multi-task sparse representation based fine-grained matching algorithm, which accounts for the average reconstruction error of probe face descriptors sparsely represented by a large dictionary of gallery descriptors in identification. Our approach is evaluated on the Bosphorus database and achieves rank-one recognition rates of 96.56, 98.82, 91.14, and 99.21 % on the entire database, and the expression, pose, and occlusion subsets, respectively. To the best of our knowledge, these are the best results reported so far on this database. Additionally, good generalization ability is also exhibited by the experiments on the FRGC v2.0 database.

Journal ArticleDOI
TL;DR: A heterogeneous multi-task learning framework for human pose estimation from monocular images using a deep convolutional neural network and it is shown that including the detection tasks helps to regularize the network, directing it to converge to a good solution.
Abstract: We propose a heterogeneous multi-task learning framework for human pose estimation from monocular images using a deep convolutional neural network. In particular, we simultaneously learn a human pose regressor and sliding-window body-part and joint-point detectors in a deep network architecture. We show that including the detection tasks helps to regularize the network, directing it to converge to a good solution. We report competitive and state-of-art results on several datasets. We also empirically show that the learned neurons in the middle layer of our network are tuned to localized body parts.

Journal ArticleDOI
TL;DR: New ways for exploiting the structure of an image database by representing it as a graph are explored, and it is shown how the rich information embedded in such a graph can improve bag-of-words-based location recognition methods.
Abstract: Recognizing the location of a query image by matching it to an image database is an important problem in computer vision, and one for which the representation of the database is a key issue. We explore new ways for exploiting the structure of an image database by representing it as a graph, and show how the rich information embedded in such a graph can improve bag-of-words-based location recognition methods. In particular, starting from a graph based on visual connectivity, we propose a method for selecting a set of overlapping subgraphs and learning a local distance function for each subgraph using discriminative techniques. For a query image, each database image is ranked according to these local distance functions in order to place the image in the right part of the graph. In addition, we propose a probabilistic method for increasing the diversity of these ranked database images, again based on the structure of the image graph. We demonstrate that our methods improve performance over standard bag-of-words methods on several existing location recognition datasets.

Journal ArticleDOI
TL;DR: The main conclusion of the paper is that with such a frugal approach it is possible to obtain results which are competitive with standard bottom-up approaches, thus establishing label embedding as an interesting and simple to compute baseline for text recognition.
Abstract: The standard approach to recognizing text in images consists in first classifying local image regions into candidate characters and then combining them with high-level word models such as conditional random fields. This paper explores a new paradigm that departs from this bottom-up view. We propose to embed word labels and word images into a common Euclidean space. Given a word image to be recognized, the text recognition problem is cast as one of retrieval: find the closest word label in this space. This common space is learned using the Structured SVM framework by enforcing matching label-image pairs to be closer than non-matching pairs. This method presents several advantages: it does not require ad-hoc or costly pre-/post-processing operations, it can build on top of any state-of-the-art image descriptor (Fisher vectors in our case), it allows for the recognition of never-seen-before words (zero-shot recognition) and the recognition process is simple and efficient, as it amounts to a nearest neighbor search. Experiments are performed on challenging datasets of license plates and scene text. The main conclusion of the paper is that with such a frugal approach it is possible to obtain results which are competitive with standard bottom-up approaches, thus establishing label embedding as an interesting and simple to compute baseline for text recognition.

Journal ArticleDOI
TL;DR: In this article, a mixture model of dynamic pedestrian-agents (MDA) is proposed to learn the collective behavior patterns of pedestrians in crowded scenes from video sequences, where each pedestrian in the crowd is driven by a dynamic pedestrianagent, which is a linear dynamic system with initial and termination states reflecting the pedestrian's belief of the starting point and the destination.
Abstract: Collective behaviors characterize the intrinsic dynamics of the crowds. Automatically understanding collective crowd behaviors has important applications to video surveillance, traffic management and crowd control, while it is closely related to scientific fields such as statistical physics and biology. In this paper, a new mixture model of dynamic pedestrian-Agents (MDA) is proposed to learn the collective behavior patterns of pedestrians in crowded scenes from video sequences. From agent-based modeling, each pedestrian in the crowd is driven by a dynamic pedestrian-agent, which is a linear dynamic system with initial and termination states reflecting the pedestrian's belief of the starting point and the destination. The whole crowd is then modeled as a mixture of dynamic pedestrian-agents. Once the model parameters are learned from the trajectories extracted from videos, MDA can simulate the crowd behaviors. It can also infer the past behaviors and predict the future behaviors of pedestrians given their partially observed trajectories, and classify them different pedestrian behaviors. The effectiveness of MDA and its applications are demonstrated by qualitative and quantitative experiments on various video surveillance sequences.

Journal ArticleDOI
TL;DR: The framework described in this paper aims to provide a unified solution for solving ego-motion estimation problems involving high-rate unsynchronized devices by presenting a continuous-time formulation that makes use of cumulative cubic B-Splines parameterized in the Lie Algebra of the group SE3.
Abstract: The use of multiple sensors for ego-motion estimation is an approach often used to provide more accurate and robust results. However, when representing ego-motion as a discrete series of poses, fusing information of unsynchronized sensors is not straightforward. The framework described in this paper aims to provide a unified solution for solving ego-motion estimation problems involving high-rate unsynchronized devices. Instead of a discrete-time pose representation, we present a continuous-time formulation that makes use of cumulative cubic B-Splines parameterized in the Lie Algebra of the group $$\mathbb {SE}3$$SE3. This trajectory representation has several advantages for sensor fusion: (1) it has local control, which enables sliding window implementations; (2) it is $$C^2$$C2 continuous, allowing predictions of inertial measurements; (3) it closely matches torque-minimal motions; (4) it has no singularities when representing rotations; (5) it easily handles measurements from multiple sensors arriving a different times when timestamps are available; and (6) it deals with rolling shutter cameras naturally. We apply this continuous-time framework to visual---inertial simultaneous localization and mapping and show that it can also be used to calibrate the entire system.

Journal ArticleDOI
TL;DR: In this paper, a user describes which properties of exemplar images should be adjusted in order to more closely match his/her mental model of the image sought, and the system learns a set of ranking functions, each of which predicts the relative strength of a nameable attribute in an image.
Abstract: We propose a novel mode of feedback for image search, where a user describes which properties of exemplar images should be adjusted in order to more closely match his/her mental model of the image sought. For example, perusing image results for a query "black shoes", the user might state, "Show me shoe images like these, but sportier." Offline, our approach first learns a set of ranking functions, each of which predicts the relative strength of a nameable attribute in an image (e.g., sportiness). At query time, the system presents the user with a set of exemplar images, and the user relates them to his/her target image with comparative statements. Using a series of such constraints in the multi-dimensional attribute space, our method iteratively updates its relevance function and re-ranks the database of images. To determine which exemplar images receive feedback from the user, we present two variants of the approach: one where the feedback is user-initiated and another where the feedback is actively system-initiated. In either case, our approach allows a user to efficiently "whittle away" irrelevant portions of the visual feature space, using semantic language to precisely communicate her preferences to the system. We demonstrate our technique for refining image search for people, products, and scenes, and we show that it outperforms traditional binary relevance feedback in terms of search speed and accuracy. In addition, the ordinal nature of relative attributes helps make our active approach efficient--both computationally for the machine when selecting the reference images, and for the user by requiring less user interaction than conventional passive and active methods.

Journal ArticleDOI
TL;DR: By using object detectors as voters to generate object confidence saliency maps, the proposed local alignments set a new state-of-the-art on both the fine-grained birds and dogs datasets, even without any human intervention.
Abstract: The aim of this paper is fine-grained categorization without human interaction. Different from prior work, which relies on detectors for specific object parts, we propose to localize distinctive details by roughly aligning the objects using just the overall shape. Then, one may proceed to the classification by examining the corresponding regions of the alignments. More specifically, the alignments are used to transfer part annotations from training images to unseen images (supervised alignment), or to blindly yet consistently segment the object in a number of regions (unsupervised alignment). We further argue that for the distinction of sub-classes, distribution-based features like color Fisher vectors are better suited for describing localized appearance of fine-grained categories than popular matching oriented shape-sensitive features, like HOG. They allow capturing the subtle local differences between subclasses, while at the same time being robust to misalignments between distinctive details. We evaluate the local alignments on the CUB-2011 and on the Stanford Dogs datasets, composed of 200 and 120, visually very hard to distinguish bird and dog species. In our experiments we study and show the benefit of the color Fisher vector parameterization, the influence of the alignment partitioning, and the significance of object segmentation on fine-grained categorization. We, furthermore, show that by using object detectors as voters to generate object confidence saliency maps, we arrive at fully unsupervised, yet highly accurate fine-grained categorization. The proposed local alignments set a new state-of-the-art on both the fine-grained birds and dogs datasets, even without any human intervention. What is more, the local alignments reveal what appearance details are most decisive per fine-grained object category.

Journal ArticleDOI
TL;DR: In this article, an object class is modeled as a deformable 3D wireframe, which enables fine-grained modeling at the level of individual vertices and faces, in order to infer and exploit object-object interactions.
Abstract: Current approaches to semantic image and scene understanding typically employ rather simple object representations such as 2D or 3D bounding boxes. While such coarse models are robust and allow for reliable object detection, they discard much of the information about objects' 3D shape and pose, and thus do not lend themselves well to higher-level reasoning. Here, we propose to base scene understanding on a high-resolution object representation. An object class--in our case cars--is modeled as a deformable 3D wireframe, which enables fine-grained modeling at the level of individual vertices and faces. We augment that model to explicitly include vertex-level occlusion, and embed all instances in a common coordinate frame, in order to infer and exploit object-object interactions. Specifically, from a single view we jointly estimate the shapes and poses of multiple objects in a common 3D frame. A ground plane in that frame is estimated by consensus among different objects, which significantly stabilizes monocular 3D pose estimation. The fine-grained model, in conjunction with the explicit 3D scene model, further allows one to infer part-level occlusions between the modeled objects, as well as occlusions by other, unmodeled scene elements. To demonstrate the benefits of such detailed object class models in the context of scene understanding we systematically evaluate our approach on the challenging KITTI street scene dataset. The experiments show that the model's ability to utilize image evidence at the level of individual parts improves monocular 3D pose estimation w.r.t. both location and (continuous) viewpoint.

Journal ArticleDOI
TL;DR: Comparison of theoretical properties and empirical performances of each blur approximation suggests that the proposed general model is preferable for approximation and inversion of a known shift-variant blur.
Abstract: Image deblurring is essential in high resolution imaging, e.g., astronomy, microscopy or computational photography. Shift-invariant blur is fully characterized by a single point-spread-function (PSF). Blurring is then modeled by a convolution, leading to efficient algorithms for blur simulation and removal that rely on fast Fourier transforms. However, in many different contexts, blur cannot be considered constant throughout the field-of-view, and thus necessitates to model variations of the PSF with the location. These models must achieve a trade-off between the accuracy that can be reached with their flexibility, and their computational efficiency. Several fast approximations of blur have been proposed in the literature. We give a unified presentation of these methods in the light of matrix decompositions of the blurring operator. We establish the connection between different computational tricks that can be found in the literature and the physical sense of corresponding approximations in terms of equivalent PSFs, physically-based approximations being preferable. We derive an improved approximation that preserves the same desirable low complexity as other fast algorithms while reaching a minimal approximation error. Comparison of theoretical properties and empirical performances of each blur approximation suggests that the proposed general model is preferable for approximation and inversion of a known shift-variant blur.

Journal ArticleDOI
TL;DR: In order to detect unsafe objects in a static scene, the method further infers hidden and situated causes (disturbances) in the scene, and then introduces intuitive physical mechanics to predict possible effects (e.g., falls) as consequences of the disturbances.
Abstract: This paper presents a new perspective for 3D scene understanding by reasoning object stability and safety using intuitive mechanics. Our approach utilizes a simple observation that, by human design, objects in static scenes should be stable in the gravity field and be safe with respect to various physical disturbances such as human activities. This assumption is applicable to all scene categories and poses useful constraints for the plausible interpretations (parses) in scene understanding. Given a 3D point cloud captured for a static scene by depth cameras, our method consists of three steps: (i) recovering solid 3D volumetric primitives from voxels; (ii) reasoning stability by grouping the unstable primitives to physically stable objects by optimizing the stability and the scene prior; and (iii) reasoning safety by evaluating the physical risks for objects under physical disturbances, such as human activity, wind or earthquakes. We adopt a novel intuitive physics model and represent the energy landscape of each primitive and object in the scene by a disconnectivity graph (DG). We construct a contact graph with nodes being 3D volumetric primitives and edges representing the supporting relations. Then we adopt a Swendson---Wang Cuts algorithm to partition the contact graph into groups, each of which is a stable object. In order to detect unsafe objects in a static scene, our method further infers hidden and situated causes (disturbances) in the scene, and then introduces intuitive physical mechanics to predict possible effects (e.g., falls) as consequences of the disturbances. In experiments, we demonstrate that the algorithm achieves a substantially better performance for (i) object segmentation, (ii) 3D volumetric recovery, and (iii) scene understanding with respect to other state-of-the-art methods. We also compare the safety prediction from the intuitive mechanics model with human judgement.

Journal ArticleDOI
TL;DR: In this paper, the authors propose to discover and learn the visual appearance of attributes automatically, using a recently introduced database, called AVA, which contains more than 250,000 images together with their aesthetic scores and textual comments given by photography enthusiasts.
Abstract: Aesthetic image analysis is the study and assessment of the aesthetic properties of images. Current computational approaches to aesthetic image analysis either provide accurate or interpretable results. To obtain both accuracy and interpretability by humans, we advocate the use of learned and nameable visual attributes as mid-level features. For this purpose, we propose to discover and learn the visual appearance of attributes automatically, using a recently introduced database, called AVA, which contains more than 250,000 images together with their aesthetic scores and textual comments given by photography enthusiasts. We provide a detailed analysis of these annotations as well as the context in which they were given. We then describe how these three key components of AVA--images, scores, and comments--can be effectively leveraged to learn visual attributes. Lastly, we show that these learned attributes can be successfully used in three applications: aesthetic quality prediction, image tagging and retrieval.

Journal ArticleDOI
TL;DR: Extensive experiments show that SPHORB consistently outperforms other existing spherical features in accuracy, efficiency and robustness to camera movements, and has been validated by real-world matching tests.
Abstract: In this paper, we propose SPHORB, a new fast and robust binary feature detector and descriptor for spherical panoramic images. In contrast to state-of-the-art spherical features, our approach stems from the geodesic grid, a nearly equal-area hexagonal grid parametrization of the sphere used in climate modeling. It enables us to directly build fine-grained pyramids and construct robust features on the hexagonal spherical grid, thus avoiding the costly computation of spherical harmonics and their associated bandwidth limitation. We further study how to achieve scale and rotation invariance for the proposed SPHORB feature. Extensive experiments show that SPHORB consistently outperforms other existing spherical features in accuracy, efficiency and robustness to camera movements. The superior performance of SPHORB has also been validated by real-world matching tests.

Journal ArticleDOI
TL;DR: Experimental results demonstrate that the proposed tracking system can effectively tackle the difficulties caused by LFR, and an integral image based parameter calculation is constructed, which greatly reduces the computational load.
Abstract: Tracking in low frame rate (LFR) videos is one of the most important problems in the tracking literature. Most existing approaches treat LFR video tracking as an abrupt motion tracking problem. However, in LFR video tracking applications, LFR not only causes abrupt motions, but also large appearance changes of objects because the objects' poses and the illumination may undergo large changes from one frame to the next. This adds extra difficulties to LFR video tracking. In this paper, we propose a robust and general tracking system for LFR videos. The tracking system consists of four major parts: dominant color-spatial based object representation, bin-ratio based similarity measure, annealed particle swarm optimization (PSO) based searching, and an integral image based parameter calculation. The first two parts are combined to provide a good solution to the appearance changes, and the abrupt motion is effectively captured by the annealed PSO based searching. Moreover, an integral image of model parameters is constructed, which provides a look-up table for parameters calculation. This greatly reduces the computational load. Experimental results demonstrate that the proposed tracking system can effectively tackle the difficulties caused by LFR.

Journal ArticleDOI
TL;DR: In this paper, the authors propose to embed Grassmann manifolds into the space of symmetric matrices by an isometric mapping, which enables them to learn a Grassmann dictionary, atom by atom.
Abstract: Sparsity-based representations have recently led to notable results in various visual recognition tasks. In a separate line of research, Riemannian manifolds have been shown useful for dealing with features and models that do not lie in Euclidean spaces. With the aim of building a bridge between the two realms, we address the problem of sparse coding and dictionary learning in Grassmann manifolds, i.e., the space of linear subspaces. To this end, we propose to embed Grassmann manifolds into the space of symmetric matrices by an isometric mapping. This in turn enables us to extend two sparse coding schemes to Grassmann manifolds. Furthermore, we propose an algorithm for learning a Grassmann dictionary, atom by atom. Lastly, to handle non-linearity in data, we extend the proposed Grassmann sparse coding and dictionary learning algorithms through embedding into higher dimensional Hilbert spaces. Experiments on several classification tasks (gender recognition, gesture classification, scene analysis, face recognition, action recognition and dynamic texture classification) show that the proposed approaches achieve considerable improvements in discrimination accuracy, in comparison to state-of-the-art methods such as kernelized Affine Hull Method and graph-embedding Grassmann discriminant analysis.