scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Pattern Analysis and Machine Intelligence in 2008"


Journal ArticleDOI
TL;DR: This paper describes the Semi-Global Matching (SGM) stereo method, which uses a pixelwise, Mutual Information based matching cost for compensating radiometric differences of input images and demonstrates a tolerance against a wide range of radiometric transformations.
Abstract: This paper describes the semiglobal matching (SGM) stereo method. It uses a pixelwise, mutual information (Ml)-based matching cost for compensating radiometric differences of input images. Pixelwise matching is supported by a smoothness constraint that is usually expressed as a global cost function. SGM performs a fast approximation by pathwise optimizations from all directions. The discussion also addresses occlusion detection, subpixel refinement, and multibaseline matching. Additionally, postprocessing steps for removing outliers, recovering from specific problems of structured environments, and the interpolation of gaps are presented. Finally, strategies for processing almost arbitrarily large images and fusion of disparity images using orthographic projection are proposed. A comparison on standard stereo images shows that SGM is among the currently top-ranked algorithms and is best, if subpixel accuracy is considered. The complexity is linear to the number of pixels and disparity range, which results in a runtime of just 1-2 seconds on typical test images. An in depth evaluation of the Ml-based matching cost demonstrates a tolerance against a wide range of radiometric transformations. Finally, examples of reconstructions from huge aerial frame and pushbroom images demonstrate that the presented ideas are working well on practical problems.

3,302 citations


Journal ArticleDOI
TL;DR: For certain classes that are particularly prevalent in the dataset, such as people, this work is able to demonstrate a recognition performance comparable to class-specific Viola-Jones style detectors.
Abstract: With the advent of the Internet, billions of images are now freely available online and constitute a dense sampling of the visual world. Using a variety of non-parametric methods, we explore this world with the aid of a large dataset of 79,302,017 images collected from the Internet. Motivated by psychophysical results showing the remarkable tolerance of the human visual system to degradations in image resolution, the images in the dataset are stored as 32 x 32 color images. Each image is loosely labeled with one of the 75,062 non-abstract nouns in English, as listed in the Wordnet lexical database. Hence the image database gives a comprehensive coverage of all object categories and scenes. The semantic information from Wordnet can be used in conjunction with nearest-neighbor methods to perform object classification over a range of semantic levels minimizing the effects of labeling noise. For certain classes that are particularly prevalent in the dataset, such as people, we are able to demonstrate a recognition performance comparable to class-specific Viola-Jones style detectors.

1,871 citations


Journal ArticleDOI
TL;DR: A closed-form solution to natural image matting that allows us to find the globally optimal alpha matte by solving a sparse linear system of equations and predicts the properties of the solution by analyzing the eigenvectors of a sparse matrix, closely related to matrices used in spectral image segmentation algorithms.
Abstract: Interactive digital matting, the process of extracting a foreground object from an image based on limited user input, is an important task in image and video editing. From a computer vision perspective, this task is extremely challenging because it is massively ill-posed - at each pixel we must estimate the foreground and the background colors, as well as the foreground opacity ("alpha matte") from a single color measurement. Current approaches either restrict the estimation to a small part of the image, estimating foreground and background colors based on nearby pixels where they are known, or perform iterative nonlinear estimation by alternating foreground and background color estimation with alpha estimation. In this paper, we present a closed-form solution to natural image matting. We derive a cost function from local smoothness assumptions on foreground and background colors and show that in the resulting expression, it is possible to analytically eliminate the foreground and background colors to obtain a quadratic cost function in alpha. This allows us to find the globally optimal alpha matte by solving a sparse linear system of equations. Furthermore, the closed-form formula allows us to predict the properties of the solution by analyzing the eigenvectors of a sparse matrix, closely related to matrices used in spectral image segmentation algorithms. We show that high-quality mattes for natural images may be obtained from a small amount of user input.

1,851 citations


Journal ArticleDOI
TL;DR: A set of energy minimization benchmarks are described and used to compare the solution quality and runtime of several common energy minimizations algorithms and a general-purpose software interface is provided that allows vision researchers to easily switch between optimization methods.
Abstract: Among the most exciting advances in early vision has been the development of efficient energy minimization algorithms for pixel-labeling tasks such as depth or texture computation. It has been known for decades that such problems can be elegantly expressed as Markov random fields, yet the resulting energy minimization problems have been widely viewed as intractable. Algorithms such as graph cuts and loopy belief propagation (LBP) have proven to be very powerful: For example, such methods form the basis for almost all the top-performing stereo methods. However, the trade-offs among different energy minimization algorithms are still not well understood. In this paper, we describe a set of energy minimization benchmarks and use them to compare the solution quality and runtime of several common energy minimization algorithms. We investigate three promising methods-graph cuts, LBP, and tree-reweighted message passing-in addition to the well-known older iterated conditional mode (ICM) algorithm. Our benchmark problems are drawn from published energy functions used for stereo, image stitching, interactive segmentation, and denoising. We also provide a general-purpose software interface that allows vision researchers to easily switch between optimization methods. The benchmarks, code, images, and results are available at http://vision.middlebury.edu/MRF/.

1,065 citations


Journal ArticleDOI
TL;DR: A novel approach for classifying points lying on a connected Riemannian manifold using the geometry of the space of d-dimensional nonsingular covariance matrices as object descriptors.
Abstract: We present a new algorithm to detect pedestrian in still images utilizing covariance matrices as object descriptors. Since the descriptors do not form a vector space, well known machine learning techniques are not well suited to learn the classifiers. The space of d-dimensional nonsingular covariance matrices can be represented as a connected Riemannian manifold. The main contribution of the paper is a novel approach for classifying points lying on a connected Riemannian manifold using the geometry of the space. The algorithm is tested on INRIA and DaimlerChrysler pedestrian datasets where superior detection rates are observed over the previous approaches.

1,044 citations


Journal ArticleDOI
TL;DR: A novel scheme of emotion-specific multilevel dichotomous classification (EMDC) is developed and compared with direct multiclass classification using the pLDA, with improved recognition accuracy of 95 percent and 70 percent for subject-dependent and subject-independent classification, respectively.
Abstract: Little attention has been paid so far to physiological signals for emotion recognition compared to audiovisual emotion channels such as facial expression or speech. This paper investigates the potential of physiological signals as reliable channels for emotion recognition. All essential stages of an automatic recognition system are discussed, from the recording of a physiological data set to a feature-based multiclass classification. In order to collect a physiological data set from multiple subjects over many weeks, we used a musical induction method that spontaneously leads subjects to real emotional states, without any deliberate laboratory setting. Four-channel biosensors were used to measure electromyogram, electrocardiogram, skin conductivity, and respiration changes. A wide range of physiological features from various analysis domains, including time/frequency, entropy, geometric analysis, subband spectra, multiscale entropy, etc., is proposed in order to find the best emotion-relevant features and to correlate them with emotional states. The best features extracted are specified in detail and their effectiveness is proven by classification results. Classification of four musical emotions (positive/high arousal, negative/high arousal, negative/low arousal, and positive/low arousal) is performed by using an extended linear discriminant analysis (pLDA). Furthermore, by exploiting a dichotomic property of the 2D emotion model, we develop a novel scheme of emotion-specific multilevel dichotomous classification (EMDC) and compare its performance with direct multiclass classification using the pLDA. An improved recognition accuracy of 95 percent and 70 percent for subject-dependent and subject-independent classification, respectively, is achieved by using the EMDC scheme.

953 citations


Journal ArticleDOI
TL;DR: It is demonstrated that the generative model can effectively handle occlusions in each time frame independently, even when the only data available comes from the output of a simple background subtraction algorithm and when the number of individuals is unknown a priori.
Abstract: Given two to four synchronized video streams taken at eye level and from different angles, we show that we can effectively combine a generative model with dynamic programming to accurately follow up to six individuals across thousands of frames in spite of significant occlusions and lighting changes. In addition, we also derive metrically accurate trajectories for each of them. Our contribution is twofold. First, we demonstrate that our generative model can effectively handle occlusions in each time frame independently, even when the only data available comes from the output of a simple background subtraction algorithm and when the number of individuals is unknown a priori. Second, we show that multiperson tracking can be reliably achieved by processing individual trajectories separately over long sequences, provided that a reasonable heuristic is used to rank these individuals and that we avoid confusing them with one another.

865 citations


Journal ArticleDOI
TL;DR: A novel algorithm for detection of certain types of unusual events based on multiple local monitors which collect low-level statistics that is robust and works well in crowded scenes where tracking-based algorithms are likely to fail.
Abstract: We present a novel algorithm for detection of certain types of unusual events. The algorithm is based on multiple local monitors which collect low-level statistics. Each local monitor produces an alert if its current measurement is unusual and these alerts are integrated to a final decision regarding the existence of an unusual event. Our algorithm satisfies a set of requirements that are critical for successful deployment of any large-scale surveillance system. In particular, it requires a minimal setup (taking only a few minutes) and is fully automatic afterwards. Since it is not based on objects' tracks, it is robust and works well in crowded scenes where tracking-based algorithms are likely to fail. The algorithm is effective as soon as sufficient low-level observations representing the routine activity have been collected, which usually happens after a few minutes. Our algorithm runs in real-time. It was tested on a variety of real-life crowded scenes. A ground-truth was extracted for these scenes, with respect to which detection and false-alarm rates are reported.

822 citations



Journal ArticleDOI
TL;DR: This work introduces a novel vocabulary using dense color SIFT descriptors and investigates the classification performance under changes in the size of the visual vocabulary, the number of latent topics learned, and the type of discriminative classifier used (k-nearest neighbor or SVM).
Abstract: We investigate whether dimensionality reduction using a latent generative model is beneficial for the task of weakly supervised scene classification. In detail, we are given a set of labeled images of scenes (for example, coast, forest, city, river, etc.), and our objective is to classify a new image into one of these categories. Our approach consists of first discovering latent ";topics"; using probabilistic Latent Semantic Analysis (pLSA), a generative model from the statistical text literature here applied to a bag of visual words representation for each image, and subsequently, training a multiway classifier on the topic distribution vector for each image. We compare this approach to that of representing each image by a bag of visual words vector directly and training a multiway classifier on these vectors. To this end, we introduce a novel vocabulary using dense color SIFT descriptors and then investigate the classification performance under changes in the size of the visual vocabulary, the number of latent topics learned, and the type of discriminative classifier used (k-nearest neighbor or SVM). We achieve superior classification performance to recent publications that have used a bag of visual word representation, in all cases, using the authors' own data sets and testing protocols. We also investigate the gain in adding spatial information. We show applications to image retrieval with relevance feedback and to scene classification in videos.

778 citations


Journal ArticleDOI
Nojun Kwak1
TL;DR: A method of principal component analysis (PCA) based on a new L1-norm optimization technique which is robust to outliers and invariant to rotations and also proven to find a locally maximal solution.
Abstract: A method of principal component analysis (PCA) based on a new L1-norm optimization technique is proposed. Unlike conventional PCA which is based on L2-norm, the proposed method is robust to outliers because it utilizes L1-norm which is less sensitive to outliers. It is invariant to rotations as well. The proposed L1-norm optimization technique is intuitive, simple, and easy to implement. It is also proven to find a locally maximal solution. The proposed method is applied to several datasets and the performances are compared with those of other conventional methods.

Journal ArticleDOI
TL;DR: New optimization and estimation techniques to address two fundamental problems in machine learning are developed, which serve as the basis for the Automatic Linguistic Indexing of Pictures - Real Time (ALIPR) system of fully automatic and high speed annotation for online pictures.
Abstract: Developing effective methods for automated annotation of digital pictures continues to challenge computer scientists. The capability of annotating pictures by computers can lead to breakthroughs in a wide range of applications, including Web image search, online picture-sharing communities, and scientific experiments. In this work, the authors developed new optimization and estimation techniques to address two fundamental problems in machine learning. These new techniques serve as the basis for the automatic linguistic indexing of pictures - real time (ALIPR) system of fully automatic and high-speed annotation for online pictures. In particular, the D2-clustering method, in the same spirit as K-Means for vectors, is developed to group objects represented by bags of weighted vectors. Moreover, a generalized mixture modeling technique (kernel smoothing as a special case) for nonvector data is developed using the novel concept of hypothetical local mapping (HLM). ALIPR has been tested by thousands of pictures from an Internet photo-sharing site, unrelated to the source of those pictures used in the training process. Its performance has also been studied at an online demonstration site, where arbitrary users provide pictures of their choices and indicate the correctness of each annotation word. The experimental results show that a single computer processor can suggest annotation terms in real time and with good accuracy.

Journal ArticleDOI
TL;DR: It is shown that kAS substantially outperform IPs for detecting shape-based classes, and the object detector is compared to the recent state-of-the-art system by Dalal and Triggs (2005).
Abstract: We present a family of scale-invariant local shape features formed by chains of k connected roughly straight contour segments (kAS), and their use for object class detection. kAS are able to cleanly encode pure fragments of an object boundary without including nearby clutter. Moreover, they offer an attractive compromise between information content and repeatability and encompass a wide variety of local shape structures. We also define a translation and scale invariant descriptor encoding the geometric configuration of the segments within a kAS, making kAS easy to reuse in other frameworks, for example, as a replacement or addition to interest points (IPs). Software for detecting and describing kAS is released at http://lear.inrialpes.fr/software. We demonstrate the high performance of kAS within a simple but powerful sliding-window object detection scheme. Through extensive evaluations, involving eight diverse object classes and more than 1,400 images, we (1) study the evolution of performance as the degree of feature complexity k varies and determine the best degree, (2) show that kAS substantially outperform IPs for detecting shape-based classes, and (3) compare our object detector to the recent state-of-the-art system by Dalal and Triggs (2005).

Journal ArticleDOI
TL;DR: This work proposes the use of a modified version of the correlation coefficient as a performance criterion for the image alignment problem and proposes an efficient approximation that leads to a closed form solution which is of low computational complexity.
Abstract: In this work we propose the use of a modified version of the correlation coefficient as a performance criterion for the image alignment problem. The proposed modification has the desirable characteristic of being invariant with respect to photometric distortions. Since the resulting similarity measure is a nonlinear function of the warp parameters, we develop two iterative schemes for its maximization, one based on the forward additive approach and the second on the inverse compositional method. As it is customary in iterative optimization, in each iteration the nonlinear objective function is approximated by an alternative expression for which the corresponding optimization is simple. In our case we propose an efficient approximation that leads to a closed form solution (per iteration) which is of low computational complexity, the latter property being particularly strong in our inverse version. The proposed schemes are tested against the forward additive Lucas-Kanade and the simultaneous inverse compositional algorithm through simulations. Under noisy conditions and photometric distortions our forward version achieves more accurate alignments and exhibits faster convergence whereas our inverse version has similar performance as the simultaneous inverse compositional algorithm but at a lower computational complexity.

Journal ArticleDOI
TL;DR: Experiments on three multibiometric databases indicate that the proposed fusion framework achieves consistently high performance compared to commonly used score fusion techniques based on score transformation and classification.
Abstract: Multibiometric systems fuse information from different sources to compensate for the limitations in performance of individual matchers. We propose a framework for the optimal combination of match scores that is based on the likelihood ratio test. The distributions of genuine and impostor match scores are modeled as finite Gaussian mixture model. The proposed fusion approach is general in its ability to handle 1) discrete values in biometric match score distributions, 2) arbitrary scales and distributions of match scores, 3) correlation between the scores of multiple matchers, and 4) sample quality of multiple biometric sources. Experiments on three multibiometric databases indicate that the proposed fusion framework achieves consistently high performance compared to commonly used score fusion techniques based on score transformation and classification.

Journal ArticleDOI
TL;DR: A unified framework for automatic estimation and removal of color noise from a single image using piecewise smooth image models is proposed and an upper bound of the real NLF is estimated by fitting a lower envelope to the standard deviations of per-segment image variances.
Abstract: Image denoising algorithms often assume an additive white Gaussian noise (AWGN) process that is independent of the actual RGB values. Such approaches cannot effectively remove color noise produced by today's CCD digital camera. In this paper, we propose a unified framework for two tasks: automatic estimation and removal of color noise from a single image using piecewise smooth image models. We introduce the noise level function (NLF), which is a continuous function describing the noise level as a function of image brightness. We then estimate an upper bound of the real NLF by fitting a lower envelope to the standard deviations of per-segment image variances. For denoising, the chrominance of color noise is significantly removed by projecting pixel values onto a line fit to the RGB values in each segment. Then, a Gaussian conditional random field (GCRF) is constructed to obtain the underlying clean image from the noisy input. Extensive experiments are conducted to test the proposed algorithm, which is shown to outperform state-of-the-art denoising algorithms.

Journal ArticleDOI
TL;DR: A reconstruction method using a Probabilistic Principal Components Analysis shape model and an estimation algorithm that simultaneously estimates 3D shape and motion for each instant, learns the PPCA model parameters, and robustly fills-in missing data points is proposed.
Abstract: This paper describes methods for recovering time-varying shape and motion of nonrigid 3D objects from uncalibrated 2D point tracks. For example, given a video recording of a talking person, we would like to estimate the 3D shape of the face at each instant and learn a model of facial deformation. Time-varying shape is modeled as a rigid transformation combined with a nonrigid deformation. Reconstruction is ill-posed if arbitrary deformations are allowed, and thus additional assumptions about deformations are required. We first suggest restricting shapes to lie within a low-dimensional subspace and describe estimation algorithms. However, this restriction alone is insufficient to constrain reconstruction. To address these problems, we propose a reconstruction method using a Probabilistic Principal Components Analysis (PPCA) shape model and an estimation algorithm that simultaneously estimates 3D shape and motion for each instant, learns the PPCA model parameters, and robustly fills-in missing data points. We then extend the model to represent temporal dynamics in object shape, allowing the algorithm to robustly handle severe cases of missing data.

Journal ArticleDOI
TL;DR: This work cast the image-ranking problem into the task of identifying "authority" nodes on an inferred visual similarity graph and proposes VisualRank to analyze the visual link structures among images and describes the techniques required to make this system practical for large-scale deployment in commercial search engines.
Abstract: Because of the relative ease in understanding and processing text, commercial image-search systems often rely on techniques that are largely indistinguishable from text search. Recently, academic studies have demonstrated the effectiveness of employing image-based features to provide either alternative or additional signals to use in this process. However, it remains uncertain whether such techniques will generalize to a large number of popular Web queries and whether the potential improvement to search quality warrants the additional computational cost. In this work, we cast the image-ranking problem into the task of identifying "authority" nodes on an inferred visual similarity graph and propose VisualRank to analyze the visual link structures among images. The images found to be "authorities" are chosen as those that answer the image-queries well. To understand the performance of such an approach in a real system, we conducted a series of large-scale experiments based on the task of retrieving images for 2,000 of the most popular products queries. Our experimental results show significant improvement, in terms of user satisfaction and relevancy, in comparison to the most recent Google image search results. Maintaining modest computational cost is vital to ensuring that this procedure can be used in practice; we describe the techniques required to make this system practical for large-scale deployment in commercial search engines.

Journal ArticleDOI
TL;DR: This work studies the mixture of dynamic textures, a statistical model for an ensemble of video sequences that is sampled from a finite collection of visual processes, each of which is a dynamic texture.
Abstract: A dynamic texture is a spatio-temporal generative model for video, which represents video sequences as observations from a linear dynamical system. This work studies the mixture of dynamic textures, a statistical model for an ensemble of video sequences that is sampled from a finite collection of visual processes, each of which is a dynamic texture. An expectation-maximization (EM) algorithm is derived for learning the parameters of the model, and the model is related to previous works in linear systems, machine learning, time- series clustering, control theory, and computer vision. Through experimentation, it is shown that the mixture of dynamic textures is a suitable representation for both the appearance and dynamics of a variety of visual processes that have traditionally been challenging for computer vision (for example, fire, steam, water, vehicle and pedestrian traffic, and so forth). When compared with state-of-the-art methods in motion segmentation, including both temporal texture methods and traditional representations (for example, optical flow or other localized motion representations), the mixture of dynamic textures achieves superior performance in the problems of clustering and segmenting video of such processes.

Journal ArticleDOI
Tong Lin1, Hongbin Zha1
TL;DR: A novel framework based on the assumption that the input high-dimensional data lie on an intrinsically low-dimensional Riemannian manifold, which can learn intrinsic geometric structures of the data, preserve radial geodesic distances, and yield regular embeddings.
Abstract: Recently, manifold learning has been widely exploited in pattern recognition, data analysis, and machine learning. This paper presents a novel framework, called Riemannian manifold learning (RML), based on the assumption that the input high-dimensional data lie on an intrinsically low-dimensional Riemannian manifold. The main idea is to formulate the dimensionality reduction problem as a classical problem in Riemannian geometry, that is, how to construct coordinate charts for a given Riemannian manifold? We implement the Riemannian normal coordinate chart, which has been the most widely used in Riemannian geometry, for a set of unorganized data points. First, two input parameters (the neighborhood size k and the intrinsic dimension d) are estimated based on an efficient simplicial reconstruction of the underlying manifold. Then, the normal coordinates are computed to map the input high-dimensional data into a low- dimensional space. Experiments on synthetic data, as well as real-world images, demonstrate that our algorithm can learn intrinsic geometric structures of the data, preserve radial geodesic distances, and yield regular embeddings.

Journal ArticleDOI
TL;DR: A randomized model verification strategy for RANSAC that removes the requirement for a priori knowledge of the fraction of outliers and estimates the quantity online, and has performance close to the theoretically optimal and is up to four times faster than previously published methods.
Abstract: A randomized model verification strategy for RANSAC is presented. The proposed method finds, like RANSAC, a solution that is optimal with user-specified probability. The solution is found in time that is close to the shortest possible and superior to any deterministic verification strategy. A provably fastest model verification strategy is designed for the (theoretical) situation when the contamination of data by outliers is known. In this case, the algorithm is the fastest possible (on the average) of all randomized RANSAC algorithms guaranteeing a confidence in the solution. The derivation of the optimality property is based on Wald's theory of sequential decision making, in particular, a modified sequential probability ratio test (SPRT). Next, the R-RANSAC with SPRT algorithm is introduced. The algorithm removes the requirement for a priori knowledge of the fraction of outliers and estimates the quantity online. We show experimentally that on standard test data, the method has performance close to the theoretically optimal and is 2 to 10 times faster than standard RANSAC and is up to four times faster than previously published methods.

Journal ArticleDOI
TL;DR: A model-based approach to interpret the image observations by multiple partially occluded human hypotheses in a Bayesian framework is proposed, which defines a joint image likelihood for multiple humans based on the appearance of the humans, the visibility of the body obtained by occlusion reasoning, and foreground/background separation.
Abstract: Segmentation and tracking of multiple humans in crowded situations is made difficult by interobject occlusion. We propose a model-based approach to interpret the image observations by multiple partially occluded human hypotheses in a Bayesian framework. We define a joint image likelihood for multiple humans based on the appearance of the humans, the visibility of the body obtained by occlusion reasoning, and foreground/background separation. The optimal solution is obtained by using an efficient sampling method, data-driven Markov chain Monte Carlo (DDMCMC), which uses image observations for proposal probabilities. Knowledge of various aspects, including human shape, camera model, and image cues, are integrated in one theoretically sound framework. We present experimental results and quantitative evaluation, demonstrating that the resulting approach is effective for very challenging data.

Journal ArticleDOI
TL;DR: A new automatic visual recognition system based only on local contour features, capable of localizing objects in space and scale, is proposed and compared with other methods based on contour and local descriptors in a detailed evaluation over 17 challenging categories.
Abstract: Psychophysical studies show that we can recognize objects using fragments of outline contour alone. This paper proposes a new automatic visual recognition system based only on local contour features, capable of localizing objects in space and scale. The system first builds a class-specific codebook of local fragments of contour using a novel formulation of chamfer matching. These local fragments allow recognition that is robust to within-class variation, pose changes, and articulation. Boosting combines these fragments into a cascaded sliding-window classifier, and mean shift is used to select strong responses as a final set of detection. We show how learning can be performed iteratively on both training and test sets to bootstrap an improved classifier. We compare with other methods based on contour and local descriptors in our detailed evaluation over 17 challenging categories and obtain highly competitive results. The results confirm that contour is indeed a powerful cue for multiscale and multiclass visual object recognition.

Journal ArticleDOI
TL;DR: This communication shows that the basic ideas of UDP and LPP are identical, and UDP is just a simplified version of LPP on the assumption that the local density is uniform.
Abstract: In (Yang et al, 2007), UDP is proposed to address the limitation of LPP for the clustering and classification tasks In this communication, we show that the basic ideas of UDP and LPP are identical In particular, UDP is just a simplified version of LPP on the assumption that the local density is uniform

Journal ArticleDOI
TL;DR: A novel approach for multi-object tracking which considers object detection and spacetime trajectory estimation as a coupled optimization problem is presented, formulated in a minimum description length hypothesis selection framework, which allows the system to recover from mismatches and temporarily lost tracks.
Abstract: We present a novel approach for multi-object tracking which considers object detection and spacetime trajectory estimation as a coupled optimization problem. Our approach is formulated in a minimum description length hypothesis selection framework, which allows our system to recover from mismatches and temporarily lost tracks. Building upon a state-of-the-art object detector, it performs multiview/multicategory object recognition to detect cars and pedestrians in the input images. The 2D object detections are checked for their consistency with (automatically estimated) scene geometry and are converted to 3D observations which are accumulated in a world coordinate frame. A subsequent trajectory estimation module analyzes the resulting 3D observations to find physically plausible spacetime trajectories. Tracking is achieved by performing model selection after every frame. At each time instant, our approach searches for the globally optimal set of spacetime trajectories which provides the best explanation for the current image and for all evidence collected so far while satisfying the constraints that no two objects may occupy the same physical space nor explain the same image pixels at any point in time. Successful trajectory hypotheses are then fed back to guide object detection in future frames. The optimization procedure is kept efficient through incremental computation and conservative hypothesis pruning. We evaluate our approach on several challenging video sequences and demonstrate its performance on both a surveillance-type scenario and a scenario where the input videos are taken from inside a moving vehicle passing through crowded city areas.

Journal ArticleDOI
TL;DR: One of the findings was that the automatic face alignment methods did not increase the gender classification rates, but manual alignment increased classification rates a little, which suggests that automatic alignment would be useful when the alignment methods are further improved.
Abstract: We present a systematic study on gender classification with automatically detected and aligned faces. We experimented with 120 combinations of automatic face detection, face alignment, and gender classification. One of the findings was that the automatic face alignment methods did not increase the gender classification rates. However, manual alignment increased classification rates a little, which suggests that automatic alignment would be useful when the alignment methods are further improved. We also found that the gender classification methods performed almost equally well with different input image sizes. In any case, the best classification rate was achieved with a support vector machine. A neural network and Adaboost achieved almost as good classification rates as the support vector machine and could be used in applications where classification speed is considered more important than the maximum classification accuracy.

Journal ArticleDOI
TL;DR: A novel graph matching algorithm is proposed and applies it to shape recognition based on object silhouettes by comparing the geodesic paths between skeleton endpoints by motivated by the fact that visually similar skeleton graphs may have completely different topological structures.
Abstract: This paper proposes a novel graph matching algorithm and applies it to shape recognition based on object silhouettes. The main idea is to match skeleton graphs by comparing the geodesic paths between skeleton endpoints. In contrast to typical tree or graph matching methods, we do not consider the topological graph structure. Our approach is motivated by the fact that visually similar skeleton graphs may have completely different topological structures. The proposed comparison of geodesic paths between endpoints of skeleton graphs yields correct matching results in such cases. The skeletons are pruned by contour partitioning with discrete. Curve evolution, which implies that the endpoints of skeleton branches correspond to visual parts of the objects. The experimental results demonstrate that our method is able to produce correct results in the presence of articulations, stretching, and contour deformations.

Journal ArticleDOI
TL;DR: This paper introduces a discriminative model for the retrieval of images from text queries that formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance.
Abstract: This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning procedure builds upon recent work on the online learning of kernel-based classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments performed over stock photography data show the advantage of our discriminative ranking approach over state-of-the-art alternatives (e.g. our model yields 26.3% average precision over the Corel dataset, which should be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our model is especially advantageous over difficult queries such as queries with few relevant pictures or multiple-word queries.

Journal ArticleDOI
TL;DR: This work introduces Extremely Randomized Clustering Forests-ensembles of randomly created clustering trees-and shows that they provide more accurate results, much faster training and testing, and good resistance to background clutter.
Abstract: Some of the most effective recent methods for content-based image classification work by quantizing image descriptors, and accumulating histograms of the resulting visual word codes. Large numbers of descriptors and large codebooks are required for good results and this becomes slow using k-means. We introduce Extremely Randomized Clustering Forests-ensembles of randomly created clustering trees-and show that they provide more accurate results, much faster training and testing, and good resistance to background clutter. Second, an efficient image classification method is proposed. It combines ERC-Forests and saliency maps very closely with the extraction of image information. For a given image, a classifier builds a saliency map online and uses it to classify the image. We show in several state-of-the-art image classification tasks that this method can speed up the classification process enormously. Finally, we show that the proposed ERC-Forests can also be used very successfully for learning distance between images. The distance computation algorithm consists of learning the characteristic differences between local descriptors sampled from pairs of same or different objects. These differences are vector quantized by ERC-Forests and the similarity measure is computed from this quantization. The similarity measure has been evaluated on four very different datasets and always outperforms the state-of-the-art competitive approaches.

Journal ArticleDOI
TL;DR: Video synopsis provides a short video representation, while preserving the essential activities of the original video, in order to create a synopsis of an endless video streams, as generated by Webcams and by surveillance cameras.
Abstract: The amount of captured video is growing with the increased numbers of video cameras, especially the increase of millions of surveillance cameras that operate 24 hours a day. Since video browsing and retrieval is time consuming, most captured video is never watched or examined. Video synopsis is an effective tool for browsing and indexing of such a video. It provides a short video representation, while preserving the essential activities of the original video. The activity in the video is condensed into a shorter period by simultaneously showing multiple activities, even when they originally occurred at different times. The synopsis video is also an index into the original video by pointing to the original time of each activity. Video synopsis can be applied to create a synopsis of an endless video streams, as generated by Webcams and by surveillance cameras. It can address queries like "show in one minute the synopsis of this camera broadcast during the past day''. This process includes two major phases: (i) an online conversion of the endless video stream into a database of objects and activities (rather than frames). (ii) A response phase, generating the video synopsis as a response to the user's query.