scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Pattern Analysis and Machine Intelligence in 2011"


Journal ArticleDOI
TL;DR: This paper investigates two fundamental problems in computer vision: contour detection and image segmentation and presents state-of-the-art algorithms for both of these tasks.
Abstract: This paper investigates two fundamental problems in computer vision: contour detection and image segmentation. We present state-of-the-art algorithms for both of these tasks. Our contour detector combines multiple local cues into a globalization framework based on spectral clustering. Our segmentation algorithm consists of generic machinery for transforming the output of any contour detector into a hierarchical region tree. In this manner, we reduce the problem of image segmentation to that of contour detection. Extensive experimental evaluation demonstrates that both our contour detection and segmentation methods significantly outperform competing algorithms. The automatically generated hierarchical segmentations can be interactively refined by user-specified annotations. Computation at multiple image resolutions provides a means of coupling our system to recognition applications.

5,068 citations


Journal ArticleDOI
TL;DR: A simple but effective image prior - dark channel prior to remove haze from a single input image is proposed, based on a key observation - most local patches in haze-free outdoor images contain some pixels which have very low intensities in at least one color channel.
Abstract: In this paper, we propose a simple but effective image prior-dark channel prior to remove haze from a single input image. The dark channel prior is a kind of statistics of outdoor haze-free images. It is based on a key observation-most local patches in outdoor haze-free images contain some pixels whose intensity is very low in at least one color channel. Using this prior with the haze imaging model, we can directly estimate the thickness of the haze and recover a high-quality haze-free image. Results on a variety of hazy images demonstrate the power of the proposed prior. Moreover, a high-quality depth map can also be obtained as a byproduct of haze removal.

3,668 citations


Journal ArticleDOI
TL;DR: This paper introduces a product quantization-based approach for approximate nearest neighbor search to decompose the space into a Cartesian product of low-dimensional subspaces and to quantize each subspace separately.
Abstract: This paper introduces a product quantization-based approach for approximate nearest neighbor search. The idea is to decompose the space into a Cartesian product of low-dimensional subspaces and to quantize each subspace separately. A vector is represented by a short code composed of its subspace quantization indices. The euclidean distance between two vectors can be efficiently estimated from their codes. An asymmetric version increases precision, as it computes the approximate distance between a vector and a code. Experimental results show that our approach searches for nearest neighbors efficiently, in particular in combination with an inverted file system. Results for SIFT and GIST image descriptors show excellent search accuracy, outperforming three state-of-the-art approaches. The scalability of our approach is validated on a data set of two billion vectors.

2,559 citations


Journal ArticleDOI
TL;DR: A set of novel features, including multiscale contrast, center-surround histogram, and color spatial distribution, are proposed to describe a salient object locally, regionally, and globally.
Abstract: In this paper, we study the salient object detection problem for images. We formulate this problem as a binary labeling task where we separate the salient object from the background. We propose a set of novel features, including multiscale contrast, center-surround histogram, and color spatial distribution, to describe a salient object locally, regionally, and globally. A conditional random field is learned to effectively combine these features for salient object detection. Further, we extend the proposed approach to detect a salient object from sequential images by introducing the dynamic salient features. We collected a large image database containing tens of thousands of carefully labeled images by multiple users and a video segment database, and conducted a set of experiments over them to demonstrate the effectiveness of the proposed approach.

2,319 citations


Journal ArticleDOI
TL;DR: It is shown that using Multiple Instance Learning (MIL) instead of traditional supervised learning avoids these problems and can therefore lead to a more robust tracker with fewer parameter tweaks.
Abstract: In this paper, we address the problem of tracking an object in a video given its location in the first frame and no other information. Recently, a class of tracking techniques called “tracking by detection” has been shown to give promising results at real-time speeds. These methods train a discriminative classifier in an online manner to separate the object from the background. This classifier bootstraps itself by using the current tracker state to extract positive and negative examples from the current frame. Slight inaccuracies in the tracker can therefore lead to incorrectly labeled training examples, which degrade the classifier and can cause drift. In this paper, we show that using Multiple Instance Learning (MIL) instead of traditional supervised learning avoids these problems and can therefore lead to a more robust tracker with fewer parameter tweaks. We propose a novel online MIL algorithm for object tracking that achieves superior results with real-time performance. We present thorough experimental results (both qualitative and quantitative) on a number of challenging video clips.

2,101 citations


Journal ArticleDOI
TL;DR: In GNMF, an affinity graph is constructed to encode the geometrical information and a matrix factorization is sought, which respects the graph structure, and the empirical study shows encouraging results of the proposed algorithm in comparison to the state-of-the-art algorithms on real-world problems.
Abstract: Matrix factorization techniques have been frequently applied in information retrieval, computer vision, and pattern recognition. Among them, Nonnegative Matrix Factorization (NMF) has received considerable attention due to its psychological and physiological interpretation of naturally occurring data whose representation may be parts based in the human brain. On the other hand, from the geometric perspective, the data is usually sampled from a low-dimensional manifold embedded in a high-dimensional ambient space. One then hopes to find a compact representation,which uncovers the hidden semantics and simultaneously respects the intrinsic geometric structure. In this paper, we propose a novel algorithm, called Graph Regularized Nonnegative Matrix Factorization (GNMF), for this purpose. In GNMF, an affinity graph is constructed to encode the geometrical information and we seek a matrix factorization, which respects the graph structure. Our empirical study shows encouraging results of the proposed algorithm in comparison to the state-of-the-art algorithms on real-world problems.

1,870 citations


Journal ArticleDOI
TL;DR: SIFT flow is proposed, a method to align an image to its nearest neighbors in a large image corpus containing a variety of scenes, where image information is transferred from the nearest neighbors to a query image according to the dense scene correspondence.
Abstract: While image alignment has been studied in different areas of computer vision for decades, aligning images depicting different scenes remains a challenging problem. Analogous to optical flow, where an image is aligned to its temporally adjacent frame, we propose SIFT flow, a method to align an image to its nearest neighbors in a large image corpus containing a variety of scenes. The SIFT flow algorithm consists of matching densely sampled, pixelwise SIFT features between two images while preserving spatial discontinuities. The SIFT features allow robust matching across different scene/object appearances, whereas the discontinuity-preserving spatial model allows matching of objects located at different parts of the scene. Experiments show that the proposed approach robustly aligns complex scene pairs containing significant spatial differences. Based on SIFT flow, we propose an alignment-based large database framework for image analysis and synthesis, where image information is transferred from the nearest neighbors to a query image according to the dense scene correspondence. This framework is demonstrated through concrete applications such as motion field prediction from a single image, motion synthesis via object transfer, satellite image registration, and face recognition.

1,726 citations


Journal ArticleDOI
TL;DR: A way to approach the problem of dense optical flow estimation by integrating rich descriptors into the variational optical flow setting, while reaching out to new domains of motion analysis where the requirement of dense sampling in time is no longer satisfied is presented.
Abstract: Optical flow estimation is classically marked by the requirement of dense sampling in time. While coarse-to-fine warping schemes have somehow relaxed this constraint, there is an inherent dependency between the scale of structures and the velocity that can be estimated. This particularly renders the estimation of detailed human motion problematic, as small body parts can move very fast. In this paper, we present a way to approach this problem by integrating rich descriptors into the variational optical flow setting. This way we can estimate a dense optical flow field with almost the same high accuracy as known from variational optical flow, while reaching out to new domains of motion analysis where the requirement of dense sampling in time is no longer satisfied.

1,429 citations


Journal ArticleDOI
TL;DR: This paper shows that reformulating that step as a constrained flow optimization results in a convex problem and takes advantage of its particular structure to solve it using the k-shortest paths algorithm, which is very fast.
Abstract: Multi-object tracking can be achieved by detecting objects in individual frames and then linking detections across frames. Such an approach can be made very robust to the occasional detection failure: If an object is not detected in a frame but is in previous and following ones, a correct trajectory will nevertheless be produced. By contrast, a false-positive detection in a few frames will be ignored. However, when dealing with a multiple target problem, the linking step results in a difficult optimization problem in the space of all possible families of trajectories. This is usually dealt with by sampling or greedy search based on variants of Dynamic Programming which can easily miss the global optimum. In this paper, we show that reformulating that step as a constrained flow optimization results in a convex problem. We take advantage of its particular structure to solve it using the k-shortest paths algorithm, which is very fast. This new approach is far simpler formally and algorithmically than existing techniques and lets us demonstrate excellent performance in two very different contexts.

1,076 citations


Journal ArticleDOI
Xue Mei1, Haibin Ling2
TL;DR: This paper proposes a robust visual tracking method by casting tracking as a sparse approximation problem in a particle filter framework and extends the method for simultaneous tracking and recognition by introducing a static template set which stores target images from different classes.
Abstract: In this paper, we propose a robust visual tracking method by casting tracking as a sparse approximation problem in a particle filter framework. In this framework, occlusion, noise, and other challenging issues are addressed seamlessly through a set of trivial templates. Specifically, to find the tracking target in a new frame, each target candidate is sparsely represented in the space spanned by target templates and trivial templates. The sparsity is achieved by solving an l1-regularized least-squares problem. Then, the candidate with the smallest projection error is taken as the tracking target. After that, tracking is continued using a Bayesian state inference framework. Two strategies are used to further improve the tracking performance. First, target templates are dynamically updated to capture appearance changes. Second, nonnegativity constraints are enforced to filter out clutter which negatively resembles tracking targets. We test the proposed approach on numerous sequences involving different types of challenges, including occlusion and variations in illumination, scale, and pose. The proposed approach demonstrates excellent performance in comparison with previously proposed trackers. We also extend the method for simultaneous tracking and recognition by introducing a static template set which stores target images from different classes. The recognition result at each frame is propagated to produce the final result for the whole video. The approach is validated on a vehicle tracking and classification task using outdoor infrared video sequences.

911 citations


Journal ArticleDOI
TL;DR: This paper presents a unified framework for the rigid and nonrigid point set registration problem in the presence of significant amounts of noise and outliers, and shows that the popular iterative closest point (ICP) method and several existing point setRegistration methods in the field are closely related and can be reinterpreted meaningfully in this general framework.
Abstract: In this paper, we present a unified framework for the rigid and nonrigid point set registration problem in the presence of significant amounts of noise and outliers. The key idea of this registration framework is to represent the input point sets using Gaussian mixture models. Then, the problem of point set registration is reformulated as the problem of aligning two Gaussian mixtures such that a statistical discrepancy measure between the two corresponding mixtures is minimized. We show that the popular iterative closest point (ICP) method and several existing point set registration methods in the field are closely related and can be reinterpreted meaningfully in our general framework. Our instantiation of this general framework is based on the the L2 distance between two Gaussian mixtures, which has the closed-form expression and in turn leads to a computationally efficient registration algorithm. The resulting registration algorithm exhibits inherent statistical robustness, has an intuitive interpretation, and is simple to implement. We also provide theoretical and experimental comparisons with other robust methods for point set registration.

Journal ArticleDOI
TL;DR: CENsus TRansform hISTogram (CENTRIST), a new visual descriptor for recognizing topological places or scene categories, is introduced and is shown to be a holistic representation and has strong generalizability for category recognition.
Abstract: CENsus TRansform hISTogram (CENTRIST), a new visual descriptor for recognizing topological places or scene categories, is introduced in this paper. We show that place and scene recognition, especially for indoor environments, require its visual descriptor to possess properties that are different from other vision domains (e.g., object recognition). CENTRIST satisfies these properties and suits the place and scene recognition task. It is a holistic representation and has strong generalizability for category recognition. CENTRIST mainly encodes the structural properties within an image and suppresses detailed textural information. Our experiments demonstrate that CENTRIST outperforms the current state of the art in several place and scene recognition data sets, compared with other descriptors such as SIFT and Gist. Besides, it is easy to implement and evaluates extremely fast.

Journal ArticleDOI
TL;DR: A new method for the detection of P300 waves is presented, based on a convolutional neural network (CNN), which provides a new way for analyzing brain activities due to the receptive field of the CNN models.
Abstract: A Brain-Computer Interface (BCI) is a specific type of human-computer interface that enables the direct communication between human and computers by analyzing brain measurements. Oddball paradigms are used in BCI to generate event-related potentials (ERPs), like the P300 wave, on targets selected by the user. A P300 speller is based on this principle, where the detection of P300 waves allows the user to write characters. The P300 speller is composed of two classification problems. The first classification is to detect the presence of a P300 in the electroencephalogram (EEG). The second one corresponds to the combination of different P300 responses for determining the right character to spell. A new method for the detection of P300 waves is presented. This model is based on a convolutional neural network (CNN). The topology of the network is adapted to the detection of P300 waves in the time domain. Seven classifiers based on the CNN are proposed: four single classifiers with different features set and three multiclassifiers. These models are tested and compared on the Data set II of the third BCI competition. The best result is obtained with a multiclassifier solution with a recognition rate of 95.5 percent, without channel selection before the classification. The proposed approach provides also a new way for analyzing brain activities due to the receptive field of the CNN models.

Journal ArticleDOI
TL;DR: This paper proposes a novel approach for multiperson tracking-by-detection in a particle filtering framework that detects and tracks a large number of dynamically moving people in complex scenes with occlusions, requires no camera or ground plane calibration, and only makes use of information from the past.
Abstract: In this paper, we address the problem of automatically detecting and tracking a variable number of persons in complex scenes using a monocular, potentially moving, uncalibrated camera. We propose a novel approach for multiperson tracking-by-detection in a particle filtering framework. In addition to final high-confidence detections, our algorithm uses the continuous confidence of pedestrian detectors and online-trained, instance-specific classifiers as a graded observation model. Thus, generic object category knowledge is complemented by instance-specific information. The main contribution of this paper is to explore how these unreliable information sources can be used for robust multiperson tracking. The algorithm detects and tracks a large number of dynamically moving people in complex scenes with occlusions, does not rely on background modeling, requires no camera or ground plane calibration, and only makes use of information from the past. Hence, it imposes very few restrictions and is suitable for online applications. Our experiments show that the method yields good tracking performance in a large variety of highly dynamic scenarios, such as typical surveillance videos, webcam footage, or sports sequences. We demonstrate that our algorithm outperforms other methods that rely on additional information. Furthermore, we analyze the influence of different algorithm components on the robustness.

Journal ArticleDOI
TL;DR: This paper introduces a square-root velocity (SRV) representation for analyzing shapes of curves in euclidean spaces under an elastic metric and demonstrates a wrapped probability distribution for capturing shapes of planar closed curves.
Abstract: This paper introduces a square-root velocity (SRV) representation for analyzing shapes of curves in euclidean spaces under an elastic metric. In this SRV representation, the elastic metric simplifies to the IL2 metric, the reparameterization group acts by isometries, and the space of unit length curves becomes the unit sphere. The shape space of closed curves is the quotient space of (a submanifold of) the unit sphere, modulo rotation, and reparameterization groups, and we find geodesics in that space using a path straightening approach. These geodesics and geodesic distances provide a framework for optimally matching, deforming, and comparing shapes. These ideas are demonstrated using: 1) shape analysis of cylindrical helices for studying protein structure, 2) shape analysis of facial curves for recognizing faces, 3) a wrapped probability distribution for capturing shapes of planar closed curves, and 4) parallel transport of deformations for predicting shapes from novel poses.

Journal ArticleDOI
TL;DR: The proposed sparse correntropy framework is more robust and efficient in dealing with the occlusion and corruption problems in face recognition as compared to the related state-of-the-art methods and the computational cost is much lower than the SRC algorithms.
Abstract: In this paper, we present a sparse correntropy framework for computing robust sparse representations of face images for recognition. Compared with the state-of-the-art l1norm-based sparse representation classifier (SRC), which assumes that noise also has a sparse representation, our sparse algorithm is developed based on the maximum correntropy criterion, which is much more insensitive to outliers. In order to develop a more tractable and practical approach, we in particular impose nonnegativity constraint on the variables in the maximum correntropy criterion and develop a half-quadratic optimization technique to approximately maximize the objective function in an alternating way so that the complex optimization problem is reduced to learning a sparse representation through a weighted linear least squares problem with nonnegativity constraint at each iteration. Our extensive experiments demonstrate that the proposed method is more robust and efficient in dealing with the occlusion and corruption problems in face recognition as compared to the related state-of-the-art methods. In particular, it shows that the proposed method can improve both recognition accuracy and receiver operator characteristic (ROC) curves, while the computational cost is much lower than the SRC algorithms.

Journal ArticleDOI
TL;DR: Hough forests can be regarded as task-adapted codebooks of local appearance that allow fast supervised training and fast matching at test time that improve the performance of the generalized Hough transform for object detection on a categorical level and extend to new domains such as object tracking and action recognition.
Abstract: The paper introduces Hough forests, which are random forests adapted to perform a generalized Hough transform in an efficient way. Compared to previous Hough-based systems such as implicit shape models, Hough forests improve the performance of the generalized Hough transform for object detection on a categorical level. At the same time, their flexibility permits extensions of the Hough transform to new domains such as object tracking and action recognition. Hough forests can be regarded as task-adapted codebooks of local appearance that allow fast supervised training and fast matching at test time. They achieve high detection accuracy since the entries of such codebooks are optimized to cast Hough votes with small variance and since their efficiency permits dense sampling of local image patches or video cuboids during detection. The efficacy of Hough forests for a set of computer vision tasks is validated through experiments on a large set of publicly available benchmark data sets and comparisons with the state-of-the-art.

Journal ArticleDOI
TL;DR: The work demonstrates the promise of eye-based activity recognition (EAR) and opens up discussion on the wider applicability of EAR to other activities that are difficult, or even impossible, to detect using common sensing modalities.
Abstract: In this work, we investigate eye movement analysis as a new sensing modality for activity recognition. Eye movement data were recorded using an electrooculography (EOG) system. We first describe and evaluate algorithms for detecting three eye movement characteristics from EOG signals-saccades, fixations, and blinks-and propose a method for assessing repetitive patterns of eye movements. We then devise 90 different features based on these characteristics and select a subset of them using minimum redundancy maximum relevance (mRMR) feature selection. We validate the method using an eight participant study in an office environment using an example set of five activity classes: copying a text, reading a printed paper, taking handwritten notes, watching a video, and browsing the Web. We also include periods with no specific activity (the NULL class). Using a support vector machine (SVM) classifier and person-independent (leave-one-person-out) training, we obtain an average precision of 76.1 percent and recall of 70.5 percent over all classes and participants. The work demonstrates the promise of eye-based activity recognition (EAR) and opens up discussion on the wider applicability of EAR to other activities that are difficult, or even impossible, to detect using common sensing modalities.

Journal ArticleDOI
TL;DR: This work investigates two representative ways of approximating the dense similarity matrix and picks the strategy of sparsifying the matrix via retaining nearest neighbors and investigates its parallelization, which can effectively handle large problems.
Abstract: Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms, such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nystrom method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through an empirical study on a document data set of 193,844 instances and a photo data set of 2,121,863, we show that our parallel algorithm can effectively handle large problems.

Journal ArticleDOI
TL;DR: A set of building blocks for constructing descriptors which can be combined together and jointly optimized so as to minimize the error of a nearest-neighbor classifier are described.
Abstract: In this paper, we explore methods for learning local image descriptors from training data. We describe a set of building blocks for constructing descriptors which can be combined together and jointly optimized so as to minimize the error of a nearest-neighbor classifier. We consider both linear and nonlinear transforms with dimensionality reduction, and make use of discriminant learning techniques such as Linear Discriminant Analysis (LDA) and Powell minimization to solve for the parameters. Using these techniques, we obtain descriptors that exceed state-of-the-art performance with low dimensionality. In addition to new experiments and recommendations for descriptor learning, we are also making available a new and realistic ground truth data set based on multiview stereo data.

Journal ArticleDOI
TL;DR: It is shown how one can create and label large data sets of real-world images to train classifiers which measure the presence, absence, or degree to which an attribute is expressed in images, which can then automatically label new images.
Abstract: We introduce the use of describable visual attributes for face verification and image search. Describable visual attributes are labels that can be given to an image to describe its appearance. This paper focuses on images of faces and the attributes used to describe them, although the concepts also apply to other domains. Examples of face attributes include gender, age, jaw shape, nose size, etc. The advantages of an attribute-based representation for vision tasks are manifold: They can be composed to create descriptions at various levels of specificity; they are generalizable, as they can be learned once and then applied to recognize new objects or categories without any further training; and they are efficient, possibly requiring exponentially fewer attributes (and training data) than explicitly naming each category. We show how one can create and label large data sets of real-world images to train classifiers which measure the presence, absence, or degree to which an attribute is expressed in images. These classifiers can then automatically label new images. We demonstrate the current effectiveness-and explore the future potential-of using attributes for face verification and image search via human and computational experiments. Finally, we introduce two new face data sets, named FaceTracer and PubFig, with labeled attributes and identities, respectively.

Journal ArticleDOI
TL;DR: This work proposes a novel method for 3D shape recovery of faces that exploits the similarity of faces, and obtains as input a single image and uses a mere single 3D reference model of a different person's face.
Abstract: Human faces are remarkably similar in global properties, including size, aspect ratio, and location of main features, but can vary considerably in details across individuals, gender, race, or due to facial expression. We propose a novel method for 3D shape recovery of faces that exploits the similarity of faces. Our method obtains as input a single image and uses a mere single 3D reference model of a different person's face. Classical reconstruction methods from single images, i.e., shape-from-shading, require knowledge of the reflectance properties and lighting as well as depth values for boundary conditions. Recent methods circumvent these requirements by representing input faces as combinations (of hundreds) of stored 3D models. We propose instead to use the input image as a guide to "mold” a single reference model to reach a reconstruction of the sought 3D shape. Our method assumes Lambertian reflectance and uses harmonic representations of lighting. It has been tested on images taken under controlled viewing conditions as well as on uncontrolled images downloaded from the Internet, demonstrating its accuracy and robustness under a variety of imaging conditions and overcoming significant differences in shape between the input and reference individuals including differences in facial expressions, gender, and race.

Journal ArticleDOI
TL;DR: This paper proposes a novel, nonparametric approach for object recognition and scene parsing using a new technology the authors name label transfer, which is easy to implement, has few parameters, and embeds contextual information naturally in the retrieval/alignment procedure.
Abstract: While there has been a lot of recent work on object recognition and image understanding, the focus has been on carefully establishing mathematical models for images, scenes, and objects. In this paper, we propose a novel, nonparametric approach for object recognition and scene parsing using a new technology we name label transfer. For an input image, our system first retrieves its nearest neighbors from a large database containing fully annotated images. Then, the system establishes dense correspondences between the input image and each of the nearest neighbors using the dense SIFT flow algorithm [28], which aligns two images based on local image structures. Finally, based on the dense scene correspondences obtained from SIFT flow, our system warps the existing annotations and integrates multiple cues in a Markov random field framework to segment and recognize the query image. Promising experimental results have been achieved by our nonparametric scene parsing system on challenging databases. Compared to existing object recognition approaches that require training classifiers or appearance models for each object category, our system is easy to implement, has few parameters, and embeds contextual information naturally in the retrieval/alignment procedure.

Journal ArticleDOI
TL;DR: An action descriptor is developed that captures the structure of temporal similarities and dissimilarities within an action sequence and is shown to be stable under performance variations within a class of actions when individual speed fluctuations are ignored.
Abstract: This paper addresses recognition of human actions under view changes. We explore self-similarities of action sequences over time and observe the striking stability of such measures across views. Building upon this key observation, we develop an action descriptor that captures the structure of temporal similarities and dissimilarities within an action sequence. Despite this temporal self-similarity descriptor not being strictly view-invariant, we provide intuition and experimental validation demonstrating its high stability under view changes. Self-similarity descriptors are also shown to be stable under performance variations within a class of actions when individual speed fluctuations are ignored. If required, such fluctuations between two different instances of the same action class can be explicitly recovered with dynamic time warping, as will be demonstrated, to achieve cross-view action synchronization. More central to the current work, temporal ordering of local self-similarity descriptors can simply be ignored within a bag-of-features type of approach. Sufficient action discrimination is still retained in this way to build a view-independent action recognition system. Interestingly, self-similarities computed from different image features possess similar properties and can be used in a complementary fashion. Our method is simple and requires neither structure recovery nor multiview correspondence estimation. Instead, it relies on weak geometric properties and combines them with machine learning for efficient cross-view action recognition. The method is validated on three public data sets. It has similar or superior performance compared to related methods and it performs well even in extreme conditions, such as when recognizing actions from top views while using side views only for training.

Journal ArticleDOI
TL;DR: The previously reported failure of the naive MAP approach is explained by demonstrating that it mostly favors no-blur explanations and it is shown that, using reasonable image priors, a naive simulations MAP estimation of both latent image and blur kernel is guaranteed to fail even with infinitely large images sampled from the prior.
Abstract: Blind deconvolution is the recovery of a sharp version of a blurred image when the blur kernel is unknown. Recent algorithms have afforded dramatic progress, yet many aspects of the problem remain challenging and hard to understand. The goal of this paper is to analyze and evaluate recent blind deconvolution algorithms both theoretically and experimentally. We explain the previously reported failure of the naive MAP approach by demonstrating that it mostly favors no-blur explanations. We show that, using reasonable image priors, a naive simulations MAP estimation of both latent image and blur kernel is guaranteed to fail even with infinitely large images sampled from the prior. On the other hand, we show that since the kernel size is often smaller than the image size, a MAP estimation of the kernel alone is well constrained and is guaranteed to succeed to recover the true blur. The plethora of recent deconvolution techniques makes an experimental evaluation on ground-truth data important. As a first step toward this experimental evaluation, we have collected blur data with ground truth and compared recent algorithms under equal settings. Additionally, our data demonstrate that the shift-invariant blur assumption made by most algorithms is often violated.

Journal ArticleDOI
TL;DR: This paper discusses how commonly used parametric models for videos and image sets can be described using the unified framework of Grassmann and Stiefel manifolds, and derives statistical modeling of inter and intraclass variations that respect the geometry of the space.
Abstract: In this paper, we examine image and video-based recognition applications where the underlying models have a special structure-the linear subspace structure. We discuss how commonly used parametric models for videos and image sets can be described using the unified framework of Grassmann and Stiefel manifolds. We first show that the parameters of linear dynamic models are finite-dimensional linear subspaces of appropriate dimensions. Unordered image sets as samples from a finite-dimensional linear subspace naturally fall under this framework. We show that an inference over subspaces can be naturally cast as an inference problem on the Grassmann manifold. To perform recognition using subspace-based models, we need tools from the Riemannian geometry of the Grassmann manifold. This involves a study of the geometric properties of the space, appropriate definitions of Riemannian metrics, and definition of geodesics. Further, we derive statistical modeling of inter and intraclass variations that respect the geometry of the space. We apply techniques such as intrinsic and extrinsic statistics to enable maximum-likelihood classification. We also provide algorithms for unsupervised clustering derived from the geometry of the manifold. Finally, we demonstrate the improved performance of these methods in a wide variety of vision applications such as activity recognition, video-based face recognition, object recognition from image sets, and activity-based video clustering.

Journal ArticleDOI
TL;DR: The proposed approach to establishing correspondences between two sets of visual features using higher order constraints instead of the unary or pairwise ones used in classical methods is compared to state-of-the-art algorithms on both synthetic and real data.
Abstract: This paper addresses the problem of establishing correspondences between two sets of visual features using higher order constraints instead of the unary or pairwise ones used in classical methods. Concretely, the corresponding hypergraph matching problem is formulated as the maximization of a multilinear objective function over all permutations of the features. This function is defined by a tensor representing the affinity between feature tuples. It is maximized using a generalization of spectral techniques where a relaxed problem is first solved by a multidimensional power method and the solution is then projected onto the closest assignment matrix. The proposed approach has been implemented, and it is compared to state-of-the-art algorithms on both synthetic and real data.

Journal ArticleDOI
TL;DR: A multi-modal approach employing both text, meta data and visual features is used to gather many, high-quality images from the Web to automatically generate a large number of images for a specified object class.
Abstract: The objective of this work is to automatically generate a large number of images for a specified object class. A multimodal approach employing both text, metadata, and visual features is used to gather many high-quality images from the Web. Candidate images are obtained by a text-based Web search querying on the object identifier (e.g., the word penguin). The Webpages and the images they contain are downloaded. The task is then to remove irrelevant images and rerank the remainder. First, the images are reranked based on the text surrounding the image and metadata features. A number of methods are compared for this reranking. Second, the top-ranked images are used as (noisy) training data and an SVM visual classifier is learned to improve the ranking further. We investigate the sensitivity of the cross-validation procedure to this noisy training data. The principal novelty of the overall method is in combining text/metadata and visual features in order to achieve a completely automatic ranking of the images. Examples are given for a selection of animals, vehicles, and other classes, totaling 18 classes. The results are assessed by precision/recall curves on ground-truth annotated data and by comparison to previous approaches, including those of Berg and Forsyth [5] and Fergus et al. [12].

Journal ArticleDOI
TL;DR: It is shown that by appropriately choosing what subproblems to use, one can design novel and very powerful MRF optimization algorithms, which are able to derive algorithms that generalize and extend state-of-the-art message-passing methods, and take full advantage of the special structure that may exist in particular MRFs.
Abstract: This paper introduces a new rigorous theoretical framework to address discrete MRF-based optimization in computer vision. Such a framework exploits the powerful technique of Dual Decomposition. It is based on a projected subgradient scheme that attempts to solve an MRF optimization problem by first decomposing it into a set of appropriately chosen subproblems, and then combining their solutions in a principled way. In order to determine the limits of this method, we analyze the conditions that these subproblems have to satisfy and demonstrate the extreme generality and flexibility of such an approach. We thus show that by appropriately choosing what subproblems to use, one can design novel and very powerful MRF optimization algorithms. For instance, in this manner we are able to derive algorithms that: 1) generalize and extend state-of-the-art message-passing methods, 2) optimize very tight LP-relaxations to MRF optimization, and 3) take full advantage of the special structure that may exist in particular MRFs, allowing the use of efficient inference techniques such as, e.g., graph-cut-based methods. Theoretical analysis on the bounds related with the different algorithms derived from our framework and experimental results/comparisons using synthetic and real data for a variety of tasks in computer vision demonstrate the extreme potentials of our approach.

Journal ArticleDOI
TL;DR: A face-image, pair-matching approach primarily developed and tested on the “Labeled Faces in the Wild” benchmark that reflects the challenges of face recognition from unconstrained images, and describes a number of novel, effective similarity measures.
Abstract: Computer vision systems have demonstrated considerable improvement in recognizing and verifying faces in digital images. Still, recognizing faces appearing in unconstrained, natural conditions remains a challenging task. In this paper, we present a face-image, pair-matching approach primarily developed and tested on the “Labeled Faces in the Wild” (LFW) benchmark that reflects the challenges of face recognition from unconstrained images. The approach we propose makes the following contributions. 1) We present a family of novel face-image descriptors designed to capture statistics of local patch similarities. 2) We demonstrate how unlabeled background samples may be used to better evaluate image similarities. To this end, we describe a number of novel, effective similarity measures. 3) We show how labeled background samples, when available, may further improve classification performance, by employing a unique pair-matching pipeline. We present state-of-the-art results on the LFW pair-matching benchmarks. In addition, we show our system to be well suited for multilabel face classification (recognition) problem, on both the LFW images and on images from the laboratory controlled multi-PIE database.