Showing papers in "Computer Vision and Image Understanding in 2012"
TL;DR: An automatic, video-based analysis of the events in Duisburg is presented and methods for the detection and early warning of dangerous situations during mass events are proposed.
Abstract: On July 24, 2010, 21 people died and more than 500 were injured in a stampede at the Loveparade, a music festival, in Duisburg, Germany. Although this tragic incident is but one among many terrible crowd disasters that occur during pilgrimage, sports events, or other mass gatherings, it stands out for it has been well documented: there were a total of seven security cameras monitoring the Loveparade and the chain of events that led to disaster was meticulously reconstructed. In this paper, we present an automatic, video-based analysis of the events in Duisburg. While physical models and simulations of human crowd behavior have been reported before, to the best of our knowledge, automatic vision systems that detect congestions and dangerous crowd turbulences in real world settings were not reported yet. Derived from lessons learned from the video footage of the Loveparade, our system is able to detect motion patterns that characterize crowd behavior in stampedes. Based on our analysis, we propose methods for the detection and early warning of dangerous situations during mass events. Since our approach mainly relies on optical flow computations, it runs in real-time and preserves privacy of the people being monitored.
TL;DR: A new integrated framework that addresses the problems of thermal-visible video registration, sensor fusion, and people tracking for far-range videos is proposed, which demonstrates the advantage of the proposed framework in obtaining better results for both image registration and tracking than separate imageRegistration and tracking methods.
Abstract: In this work, we propose a new integrated framework that addresses the problems of thermal-visible video registration, sensor fusion, and people tracking for far-range videos. The video registration is based on a RANSAC trajectory-to-trajectory matching, which estimates an affine transformation matrix that maximizes the overlapping of thermal and visible foreground pixels. Sensor fusion uses the aligned images to compute sum-rule silhouettes, and then constructs thermal-visible object models. Finally, multiple object tracking uses blobs constructed in sensor fusion to output the trajectories. Results demonstrate the advantage of our proposed framework in obtaining better results for both image registration and tracking than separate image registration and tracking methods.
TL;DR: An approach for anomaly detection and localization, in video surveillance applications, based on spatio-temporal features that capture scene dynamic statistics together with appearance is proposed, and outperforms other state-of-the-art real-time approaches.
Abstract: In this paper we propose an approach for anomaly detection and localization, in video surveillance applications, based on spatio-temporal features that capture scene dynamic statistics together with appearance. Real-time anomaly detection is performed with an unsupervised approach using a non-parametric modeling, evaluating directly multi-scale local descriptor statistics. A method to update scene statistics is also proposed, to deal with the scene changes that typically occur in a real-world setting. The proposed approach has been tested on publicly available datasets, to evaluate anomaly detection and localization, and outperforms other state-of-the-art real-time approaches.
TL;DR: An efficient combination of algorithms for the automated localization of the optic disc and macula in retinal fundus images by combining the prediction of multiple algorithms benefiting from their strength and compensating their weaknesses is proposed.
Abstract: This paper proposes an efficient combination of algorithms for the automated localization of the optic disc and macula in retinal fundus images. There is in fact no reason to assume that a single algorithm would be optimal. An ensemble of algorithms based on different principles can be more accurate than any of its individual members if the individual algorithms are doing better than random guessing. We aim to obtain an improved optic disc and macula detector by combining the prediction of multiple algorithms, benefiting from their strength and compensating their weaknesses. The location with maximum number of detectors' outputs is formally the hotspot and is used to find the optic disc or macula center. An assessment of the performance of integrated system and detectors working separately is also presented. Our proposed combination of detectors achieved overall highest performance in detecting optic disc and fovea closest to the manually center chosen by the retinal specialist.
TL;DR: This paper presents a novel approach for robust and selective STIP detection, by applying surround suppression combined with local and temporal constraints, and introduces a novel vocabulary building strategy by combining spatial pyramid and vocabulary compression techniques, resulting in improved performance and efficiency.
Abstract: Recent progress in the field of human action recognition points towards the use of Spatio-Temporal Interest Points (STIPs) for local descriptor-based recognition strategies. In this paper, we present a novel approach for robust and selective STIP detection, by applying surround suppression combined with local and temporal constraints. This new method is significantly different from existing STIP detection techniques and improves the performance by detecting more repeatable, stable and distinctive STIPs for human actors, while suppressing unwanted background STIPs. For action representation we use a bag-of-video words (BoV) model of local N-jet features to build a vocabulary of visual-words. To this end, we introduce a novel vocabulary building strategy by combining spatial pyramid and vocabulary compression techniques, resulting in improved performance and efficiency. Action class specific Support Vector Machine (SVM) classifiers are trained for categorization of human actions. A comprehensive set of experiments on popular benchmark datasets (KTH and Weizmann), more challenging datasets of complex scenes with background clutter and camera motion (CVC and CMU), movie and YouTube video clips (Hollywood 2 and YouTube), and complex scenes with multiple actors (MSR I and Multi-KTH), validates our approach and show state-of-the-art performance. Due to the unavailability of ground truth action annotation data for the Multi-KTH dataset, we introduce an actor specific spatio-temporal clustering of STIPs to address the problem of automatic action annotation of multiple simultaneous actors. Additionally, we perform cross-data action recognition by training on source datasets (KTH and Weizmann) and testing on completely different and more challenging target datasets (CVC, CMU, MSR I and Multi-KTH). This documents the robustness of our proposed approach in the realistic scenario, using separate training and test datasets, which in general has been a shortcoming in the performance evaluation of human action recognition techniques.
TL;DR: Experiments showed that the proposed statistical approach to visual texture description outperforms existing static texture classification methods and is comparable to the top dynamic texture classification techniques.
Abstract: Visual texture is a powerful cue for the semantic description of scene structures that exhibit a high degree of similarity in their image intensity patterns. This paper describes a statistical approach to visual texture description that combines a highly discriminative local feature descriptor with a powerful global statistical descriptor. Based upon a SIFT-like feature descriptor densely estimated at multiple window sizes, a statistical descriptor, called the multi-fractal spectrum (MFS), extracts the power-law behavior of the local feature distributions over scale. Through this combination strong robustness to environmental changes including both geometric and photometric transformations is achieved. Furthermore, to increase the robustness to changes in scale, a multi-scale representation of the multi-fractal spectra under a wavelet tight frame system is derived. The proposed statistical approach is applicable to both static and dynamic textures. Experiments showed that the proposed approach outperforms existing static texture classification methods and is comparable to the top dynamic texture classification techniques.
TL;DR: A new vision based framework for driver foot behavior analysis is proposed using optical flow based foot tracking and a Hidden Markov Model (HMM) based technique to characterize the temporal foot behavior.
Abstract: Understanding driver behavior is an essential component in human-centric Intelligent Driver Assistance Systems. Specifically, driver foot behavior is an important factor in controlling the vehicle, though there have been very few research studies on analyzing foot behavior. While embedded pedal sensors may reveal some information about driver foot behavior, using vision-based foot behavior analysis has additional advantages. The foot movement before and after a pedal press can provide valuable information for better semantic understanding of driver behaviors, states, and styles. They can also be used to gain a time advantage in predicting a pedal press before it actually happens, which is very important for providing proper assistance to driver in time critical (e.g. safety related) situations. In this paper, we propose and develop a new vision based framework for driver foot behavior analysis using optical flow based foot tracking and a Hidden Markov Model (HMM) based technique to characterize the temporal foot behavior. In our experiment with a real-world driving testbed, we also use our trained HMM foot behavior model for prediction of brake and acceleration pedal presses. The experimental results over different subjects provided high accuracy (~94% on average) for both foot behavior state inference and pedal press prediction. By 133ms before the actual press, ~74% of the pedal presses were predicted correctly. This shows the promise of applying this approach for real-world driver assistance systems.
TL;DR: This paper reviews the existing methods designed to calibrate any central omnivision system and analyzes their advantages and drawbacks doing a deep comparison using simulated and real data.
Abstract: Omnidirectional cameras are becoming increasingly popular in computer vision and robotics. Camera calibration is a step before performing any task involving metric scene measurement, required in nearly all robotics tasks. In recent years many different methods to calibrate central omnidirectional cameras have been developed, based on different camera models and often limited to a specific mirror shape. In this paper we review the existing methods designed to calibrate any central omnivision system and analyze their advantages and drawbacks doing a deep comparison using simulated and real data. We choose methods available as OpenSource and which do not require a complex pattern or scene. The evaluation protocol of calibration accuracy also considers 3D metric reconstruction combining omnidirectional images. Comparative results are shown and discussed in detail.
TL;DR: In this article, a discriminant multiple coupled latent subspace framework is proposed to find the sets of projection directions for different poses such that the projected images of the same subject in different poses are maximally correlated in the latent space.
Abstract: We propose a novel pose-invariant face recognition approach which we call Discriminant Multiple Coupled Latent Subspace framework. It finds the sets of projection directions for different poses such that the projected images of the same subject in different poses are maximally correlated in the latent space. Discriminant analysis with artificially simulated pose errors in the latent space makes it robust to small pose errors caused due to a subject's incorrect pose estimation. We do a comparative analysis of three popular latent space learning approaches: Partial Least Squares (PLSs), Bilinear Model (BLM) and Canonical Correlational Analysis (CCA) in the proposed coupled latent subspace framework. We experimentally demonstrate that using more than two poses simultaneously with CCA results in better performance. We report state-of-the-art results for pose-invariant face recognition on CMU PIE and FERET and comparable results on MultiPIE when using only four fiducial points for alignment and intensity features.
TL;DR: The discriminant movement representation combined with camera viewpoint identification and a nearest centroid classification step leads to a high human movement classification accuracy.
Abstract: In this paper, a novel multi-view human movement recognition method is presented. A novel representation of multi-view human movement videos is proposed that is based on learning basic multi-view human movement primitives, called multi-view dynemes. The movement video is represented in a new feature space (called dyneme space) using these multi-view dynemes, thus producing a time invariant multi-view movement representation. Fuzzy distances from the multi-view dynemes are used to represent the human body postures in the dyneme space. Three variants of Linear Discriminant Analysis (LDA) are evaluated to achieve a discriminant movement representation in a low dimensionality space. The view identification problem is solved either by using a circular block shift procedure followed by the evaluation of the minimum Euclidean distance from any dyneme, or by exploiting the circular shift invariance property of the Discrete Fourier Transform (DFT). The discriminant movement representation combined with camera viewpoint identification and a nearest centroid classification step leads to a high human movement classification accuracy.
TL;DR: By exploiting contextual information, the proposed system is able to make more accurate detections, especially of those behaviours which are only suspicious in some contexts while being normal in the others, and gives critical feedback to the system designers to refine the system.
Abstract: Video surveillance systems using Closed Circuit Television (CCTV) cameras, is one of the fastest growing areas in the field of security technologies. However, the existing video surveillance systems are still not at a stage where they can be used for crime prevention. The systems rely heavily on human observers and are therefore limited by factors such as fatigue and monitoring capabilities over long periods of time. This work attempts to address these problems by proposing an automatic suspicious behaviour detection which utilises contextual information. The utilisation of contextual information is done via three main components: a context space model, a data stream clustering algorithm, and an inference algorithm. The utilisation of contextual information is still limited in the domain of suspicious behaviour detection. Furthermore, it is nearly impossible to correctly understand human behaviour without considering the context where it is observed. This work presents experiments using video feeds taken from CAVIAR dataset and a camera mounted on one of the buildings Z-Block) at the Queensland University of Technology, Australia. From these experiments, it is shown that by exploiting contextual information, the proposed system is able to make more accurate detections, especially of those behaviours which are only suspicious in some contexts while being normal in the others. Moreover, this information gives critical feedback to the system designers to refine the system.
TL;DR: Five core optimization constraints which are used by 13 methods together with different optimization techniques are identified and part of the 13 methods are combined with techniques for robust estimation like m-functions or RANSAC in order to achieve an improvement of estimates for noisy visual motion fields.
Abstract: If a visual observer moves through an environment, the patterns of light that impinge its retina vary leading to changes in sensed brightness. Spatial shifts of brightness patterns in the 2D image over time are called optic flow. In contrast to optic flow visual motion fields denote the displacement of 3D scene points projected onto the camera's sensor surface. For translational and rotational movement through a rigid scene parametric models of visual motion fields have been defined. Besides ego-motion these models provide access to relative depth, and both ego-motion and depth information is useful for visual navigation. In the past 30 years methods for ego-motion estimation based on models of visual motion fields have been developed. In this review we identify five core optimization constraints which are used by 13 methods together with different optimization techniques. In the literature methods for ego-motion estimation typically have been evaluated by using an error measure which tests only a specific ego-motion. Furthermore, most simulation studies used only a Gaussian noise model. Unlike, we test multiple types and instances of ego-motion. One type is a fixating ego-motion, another type is a curve-linear ego-motion. Based on simulations we study properties like statistical bias, consistency, variability of depths, and the robustness of the methods with respect to a Gaussian or outlier noise model. In order to achieve an improvement of estimates for noisy visual motion fields, part of the 13 methods are combined with techniques for robust estimation like m-functions or RANSAC. Furthermore, a realistic scenario of a stereo image sequence has been generated and used to evaluate methods of ego-motion estimation provided by estimated optic flow and depth information.
TL;DR: A novel online framework for behavior understanding, in visual workflows, capable of achieving high recognition rates in real-time, using a Bayesian filter supported by hidden Markov models and a novel re-adjustment framework of behavior recognition and classification.
Abstract: In this paper, we propose a novel online framework for behavior understanding, in visual workflows, capable of achieving high recognition rates in real-time. To effect online recognition, we propose a methodology that employs a Bayesian filter supported by hidden Markov models. We also introduce a novel re-adjustment framework of behavior recognition and classification by incorporating the user's feedback into the learning process through two proposed schemes: a plain non-linear one and a more sophisticated recursive one. The proposed approach aims at dynamically correcting erroneous classification results to enhance the behavior modeling and therefore the overall classification rates. The performance is thoroughly evaluated under real-life complex visual behavior understanding scenarios in an industrial plant. The obtained results are compared and discussed.
TL;DR: This paper presents a principled approach to learning a semantic vocabulary from a large amount of video words using Diffusion Maps embedding, and conjecture that the mid-level features produced by similar video sources must lie on a certain manifold.
Abstract: Efficient modeling of actions is critical for recognizing human actions. Recently, bag of video words (BoVW) representation, in which features computed around spatiotemporal interest points are quantized into video words based on their appearance similarity, has been widely and successfully explored. The performance of this representation however, is highly sensitive to two main factors: the granularity, and therefore, the size of vocabulary, and the space in which features and words are clustered, i.e., the distance measure between data points at different levels of the hierarchy. The goal of this paper is to propose a representation and learning framework that addresses both these limitations. We present a principled approach to learning a semantic vocabulary from a large amount of video words using Diffusion Maps embedding. As opposed to flat vocabularies used in traditional methods, we propose to exploit the hierarchical nature of feature vocabularies representative of human actions. Spatiotemporal features computed around interest points in videos form the lowest level of representation. Video words are then obtained by clustering those spatiotemporal features. Each video word is then represented by a vector of Pointwise Mutual Information (PMI) between that video word and training video clips, and is treated as a mid-level feature. At the highest level of the hierarchy, our goal is to further cluster the mid-level features, while exploiting semantically meaningful distance measures between them. We conjecture that the mid-level features produced by similar video sources (action classes) must lie on a certain manifold. To capture the relationship between these features, and retain it during clustering, we propose to use diffusion distance as a measure of similarity between them. The underlying idea is to embed the mid-level features into a lower-dimensional space, so as to construct a compact yet discriminative, high level vocabulary. Unlike some of the supervised vocabulary construction approaches and the unsupervised methods such as pLSA and LDA, Diffusion Maps can capture local relationship between the mid-level features on the manifold. We have tested our approach on diverse datasets and have obtained very promising results.
TL;DR: This work presents a graph matching method to solve the point-set correspondence problem, which is posed as one of mixture modelling, and uses a true continuous underlying correspondence variable.
Abstract: Finding correspondences between two point-sets is a common step in many vision applications (e.g., image matching or shape retrieval). We present a graph matching method to solve the point-set correspondence problem, which is posed as one of mixture modelling. Our mixture model encompasses a model of structural coherence and a model of affine-invariant geometrical errors. Instead of absolute positions, the geometrical positions are represented as relative positions of the points with respect to each other. We derive the Expectation-Maximization algorithm for our mixture model. In this way, the graph matching problem is approximated, in a principled way, as a succession of assignment problems which are solved using Softassign. Unlike other approaches, we use a true continuous underlying correspondence variable. We develop effective mechanisms to detect outliers. This is a useful technique for improving results in the presence of clutter. We evaluate the ability of our method to locate proper matches as well as to recognize object categories in a series of registration and recognition experiments. Our method compares favourably to other graph matching methods as well as to point-set registration methods and outlier rejectors.
TL;DR: 3D articulated tracking avoids the need for view-based models, specific camera viewpoints, and constrained domains and provides a natural benchmark for evaluating the performance of 3D pose tracking methods (vs. conventional Euclidean joint error metrics).
Abstract: It is well known that biological motion conveys a wealth of socially meaningful information. From even a brief exposure, biological motion cues enable the recognition of familiar people, and the inference of attributes such as gender, age, mental state, actions and intentions. In this paper we show that from the output of a video-based 3D human tracking algorithm we can infer physical attributes (e.g., gender and weight) and aspects of mental state (e.g., happiness or sadness). In particular, with 3D articulated tracking we avoid the need for view-based models, specific camera viewpoints, and constrained domains. The task is useful for man-machine communication, and it provides a natural benchmark for evaluating the performance of 3D pose tracking methods (vs. conventional Euclidean joint error metrics). We show results on a large corpus of motion capture data and on the output of a simple 3D pose tracker applied to videos of people walking.
TL;DR: This paper presents an integrated solution for the problem of detecting, tracking and identifying vehicles in a tunnel surveillance application, taking into account practical constraints including real-time operation, poor imaging conditions, and a decentralized architecture.
Abstract: This paper presents an integrated solution for the problem of detecting, tracking and identifying vehicles in a tunnel surveillance application, taking into account practical constraints including real-time operation, poor imaging conditions, and a decentralized architecture. Vehicles are followed through the tunnel by a network of non-overlapping cameras. They are detected and tracked in each camera and then identified, i.e. matched to any of the vehicles detected in the previous camera (s). To limit the computational load, we propose to reuse the same set of Haar-features for each of these steps. For the detection, we use an AdaBoost cascade. Here we introduce a composite confidence score, integrating information from all stages of the cascade. A subset of the features used for detection is then selected, optimizing for the identification problem. This results in a compact binary 'vehicle fingerprint', requiring minimal bandwidth. Finally, we show that the same subset of features can also be used effectively for tracking. This Haar-features based 'tracking-by-identification' yields surprisingly good results on standard datasets, without the need to update the model online. The general multi-camera framework is validated using three tunnel surveillance videos.
TL;DR: This paper attacks the key problem of camera pose estimation, in an automatic and efficient way, by matching vanishing points with 3D directions derived from a 3D range model, and utilizing low-level linear features.
Abstract: The photorealistic modeling of large-scale objects, such as urban scenes, requires the combination of range sensing technology and digital photography. In this paper, we attack the key problem of camera pose estimation, in an automatic and efficient way. First, the camera orientation is recovered by matching vanishing points (extracted from 2D images) with 3D directions (derived from a 3D range model). Then, a hypothesis-and-test algorithm computes the camera positions with respect to the 3D range model by matching corresponding 2D and 3D linear features. The camera positions are further optimized by minimizing a line-to-line distance. The advantage of our method over earlier work has to do with the fact that we do not need to rely on extracted planar facades, or other higher-order features; we are utilizing low-level linear features. That makes this method more general, robust, and efficient. We have also developed a user-interface for allowing users to accurately texture-map 2D images onto 3D range models at interactive rates. We have tested our system in a large variety of urban scenes.
TL;DR: The proposed approach was applied for the segmentation of internal brain structures in magnetic resonance images and shows the relevance of the optimization criteria and the interest of the backtracking procedure to guarantee good and consistent results.
Abstract: A sequential segmentation framework, where objects in an image are successively segmented, generally raises some questions about the ''best'' segmentation sequence to follow and/or how to avoid error propagation. In this work, we propose original approaches to answer these questions in the case where the objects to segment are represented by a model describing the spatial relations between objects. The process is guided by a criterion derived from visual attention, and more precisely from a saliency map, along with some spatial information to focus the attention. This criterion is used to optimize the segmentation sequence. Spatial knowledge is also used to ensure the consistency of the results and to allow backtracking on the segmentation order if needed. The proposed approach was applied for the segmentation of internal brain structures in magnetic resonance images. The results show the relevance of the optimization criteria and the interest of the backtracking procedure to guarantee good and consistent results.
TL;DR: A novel moving object detection algorithm is proposed for which an illumination change model, a chromaticity difference model and a brightness ratio model are developed that estimates the intensity difference and intensity ratio of false foreground pixels, respectively.
Abstract: To solve the problem due to fast illumination change in a visual surveillance system, we propose a novel moving object detection algorithm for which we develop an illumination change model, a chromaticity difference model, and a brightness ratio model. When fast illumination change occurs, background pixels as well as moving object pixels are detected as foreground pixels. To separate detected foreground pixels into moving object pixels and false foreground pixels, we develop a chromaticity difference model and a brightness ratio model that estimates the intensity difference and intensity ratio of false foreground pixels, respectively. These models are based on the proposed illumination change model. Based on experimental results, the proposed method shows excellent performance under various illumination change conditions while operating in real-time.
TL;DR: The proposed tracker enhances the recently suggested FragTrack algorithm to employ an adaptive cue integration scheme by embedding the original tracker into a particle filter framework, associating a reliability value to each fragment that describes a different part of the target object and dynamically adjusting these reliabilities at each frame with respect to the current context.
Abstract: In this paper, we address the issue of part-based tracking by proposing a new fragments-based tracker. The proposed tracker enhances the recently suggested FragTrack algorithm to employ an adaptive cue integration scheme. This is done by embedding the original tracker into a particle filter framework, associating a reliability value to each fragment that describes a different part of the target object and dynamically adjusting these reliabilities at each frame with respect to the current context. Particularly, the vote of each fragment contributes to the joint tracking result according to its reliability, and this allows us to achieve a better accuracy in handling partial occlusions and pose changes while preserving and even improving the efficiency of the original tracker. In order to demonstrate the performance and the effectiveness of the proposed algorithm we present qualitative and quantitative results on a number of challenging video sequences.
TL;DR: An accurate and fast approach for MR-image segmentation of brain tissues, that is robust to anatomical variations and takes an average of less than 1min for completion on modern PCs is presented.
Abstract: We present an accurate and fast approach for MR-image segmentation of brain tissues, that is robust to anatomical variations and takes an average of less than 1min for completion on modern PCs. The method first corrects voxel values in the brain based on local estimations of the white-matter intensities. This strategy is inspired by other works, but it is simple, fast, and very effective. Tissue classification exploits a recent clustering approach based on the motion of optimum-path forest (OPF), which can find natural groups such that the absolute majority of voxels in each group belongs to the same class. First, a small random set of brain voxels is used for OPF clustering. Cluster labels are propagated to the remaining voxels, and then class labels are assigned to each group. The experiments used several datasets from three protocols (involving normal subjects, phantoms, and patients), two state-of-the-art approaches, and a novel methodology which finds the best choice of parameters for each method within the operational range of these parameters using a training dataset. The proposed method outperformed the compared approaches in speed, accuracy, and robustness.
TL;DR: This article defines the so-called bio-inspired features associated to an input video, based on the average activity of MT cells, and shows how these features can be used in a standard classification method to perform action recognition.
Abstract: Motion is a key feature for a wide class of computer vision approaches to recognize actions. In this article, we show how to define bio-inspired features for action recognition. To do so, we start from a well-established bio-inspired motion model of cortical areas V1 and MT. The primary visual cortex, designated as V1, is the first cortical area encountered in the visual stream processing and early responses of V1 cells consist in tiled sets of selective spatiotemporal filters. The second cortical area of interest in this article is area MT where MT cells pool incoming information from V1 according to the shape and characteristic of their receptive field. To go beyond the classical models and following the observations from Xiao et al. , we propose here to model different surround geometries for MT cells receptive fields. Then, we define the so-called bio-inspired features associated to an input video, based on the average activity of MT cells. Finally, we show how these features can be used in a standard classification method to perform action recognition. Results are given for the Weizmann and KTH databases. Interestingly, we show that the diversity of motion representation at the MT level (different surround geometries), is a major advantage for action recognition. On the Weizmann database, the inclusion of different MT surround geometries improved the recognition rate from 63.01+/-2.07% up to 99.26+/-1.66% in the best case. Similarly, on the KTH database, the recognition rate was significantly improved with the inclusion of MT different surround geometries (from 47.82+/-2.71% up to 92.44+/-0.01% in the best case). We also discussed the limitations of the current approach which are closely related to the input video duration. These promising results encourage us to further develop bio-inspired models incorporating other brain mechanisms and cortical areas in order to deal with more complex videos.
TL;DR: This work uses multiple relatively-shifted LR range images, where the motion between the LR images serves as a cue for super-resolution, and exploits a cue from segmentation of an optical image of the same scene, which constrains pixels in the same color segment to have similar range values.
Abstract: Range images often suffer from issues such as low resolution (LR) (for low-cost scanners) and presence of missing regions due to poor reflectivity, and occlusions. Another common problem (with high quality scanners) is that of long acquisition times. In this work, we propose two approaches to counter these shortcomings. Our first proposal which addresses the issues of low resolution as well as missing regions, is an integrated super-resolution (SR) and inpainting approach. We use multiple relatively-shifted LR range images, where the motion between the LR images serves as a cue for super-resolution. Our imaging model also accounts for missing regions to enable inpainting. Our framework models the high resolution (HR) range as a Markov random field (MRF), and uses inhomogeneous MRF priors to constrain the solution differently for inpainting and super-resolution. Our super-resolved and inpainted outputs show significant improvements over their LR/interpolated counterparts. Our second proposal addresses the issue of long acquisition times by facilitating reconstruction of range data from very sparse measurements. Our technique exploits a cue from segmentation of an optical image of the same scene, which constrains pixels in the same color segment to have similar range values. Our approach is able to reconstruct range images with as little as 10% data. We also study the performance of both the proposed approaches in a noisy scenario as well as in the presence of alignment errors.
TL;DR: MMTrack (max-margin tracker), a single-target tracker that linearly combines constant and adaptive appearance features, is introduced and a system combining a variety of appearance features and a motion model is demonstrated, with the parameters of these features learned jointly in a coherent learning framework.
Abstract: We introduce MMTrack (max-margin tracker), a single-target tracker that linearly combines constant and adaptive appearance features. We frame offline single-camera tracking as a structured output prediction task where the goal is to find a sequence of locations of the target given a video. Following recent advances in machine learning, we discriminatively learn tracker parameters by first generating suitable bad trajectories and then employing a margin criterion to learn how to distinguish among ground truth trajectories and all other possibilities. Our framework for tracking is general, and can be used with a variety of features. We demonstrate a system combining a variety of appearance features and a motion model, with the parameters of these features learned jointly in a coherent learning framework. Further, taking advantage of a reliable human detector, we present a natural way of extending our tracker to a robust detection and tracking system. We apply our framework to pedestrian tracking and experimentally demonstrate the effectiveness of our method on two real-world data sets, achieving results comparable to state-of-the-art tracking systems.
TL;DR: The minimal levels of linear correlation between the outputs produced by the proposed strategy and other state-of-the-art techniques suggest that the fusion of both recognition techniques significantly improve performance, which is regarded as a positive step towards the development of extremely ambitious types of biometric recognition.
Abstract: Despite the substantial research into the development of covert iris recognition technologies, no machine to date has been able to reliably perform recognition of human beings in real-world data. This limitation is especially evident in the application of such technology to large-scale identification scenarios, which demand extremely low error rates to avoid frequent false alarms. Most previously published works have used intensity data and performed multi-scale analysis to achieve recognition, obtaining encouraging performance values that are nevertheless far from desirable. This paper presents two key innovations. (1) A recognition scheme is proposed based on techniques that are substantially different from those traditionally used, starting with the dynamic partition of the noise-free iris into disjoint regions from which MPEG-7 color and shape descriptors are extracted. (2) The minimal levels of linear correlation between the outputs produced by the proposed strategy and other state-of-the-art techniques suggest that the fusion of both recognition techniques significantly improve performance, which is regarded as a positive step towards the development of extremely ambitious types of biometric recognition.
TL;DR: Experimental validation on data from two different datasets, illustrates the significant biometric authentication potential of the proposed framework in realistic scenarios, whereby the user is unobtrusively observed, while the use of the static anthropometric profile is seen to significantly improve performance with respect to state-of-the-art approaches.
Abstract: This paper presents a novel framework for unobtrusive biometric authentication based on the spatiotemporal analysis of human activities Initially, the subject's actions that are recorded by a stereoscopic camera, are detected utilizing motion history images Then, two novel unobtrusive biometric traits are proposed, namely the static anthropometric profile that accurately encodes the inter-subject variability with respect to human body dimensions, while the activity related trait that is based on dynamic motion trajectories encodes the behavioral inter-subject variability for performing a specific action Subsequently, score level fusion is performed via support vector machines Finally, an ergonomics-based quality indicator is introduced for the evaluation of the authentication potential for a specific trial Experimental validation on data from two different datasets, illustrates the significant biometric authentication potential of the proposed framework in realistic scenarios, whereby the user is unobtrusively observed, while the use of the static anthropometric profile is seen to significantly improve performance with respect to state-of-the-art approaches
TL;DR: In this paper, a set of composed complex-cue image descriptors is introduced and evaluated with respect to the problems of recognizing previously seen object instances from previously unseen views, and classifying previously unseen objects into visual categories.
Abstract: Recent work has shown that effective methods for recognizing objects and spatio-temporal events can be constructed based on histograms of receptive field like image operations. This paper presents the results of an extensive study of the performance of different types of receptive field like image descriptors for histogram-based object recognition, based on different combinations of image cues in terms of Gaussian derivatives or differential invariants applied to either intensity information, color-opponent channels or both. A rich set of composed complex-cue image descriptors is introduced and evaluated with respect to the problems of (i) recognizing previously seen object instances from previously unseen views, and (ii) classifying previously unseen objects into visual categories. It is shown that there exist novel histogram descriptors with significantly better recognition performance compared to previously used histogram features within the same class. Specifically, the experiments show that it is possible to obtain more discriminative features by combining lower-dimensional scale-space features into composed complex-cue histograms. Furthermore, different types of image descriptors have different relative advantages with respect to the problems of object instance recognition vs. object category classification. These conclusions are obtained from extensive evaluations on two mutually independent data sets. For the task of recognizing specific object instances, combined histograms of spatial and spatio-chromatic derivatives are highly discriminative, and several image descriptors in terms rotationally invariant (intensity and spatio-chromatic) differential invariants up to order two lead to very high recognition rates. For category classification, primary information is contained in both first-and second-order derivatives, where second-order partial derivatives constitute the most discriminative cue. Dimensionality reduction by principal component analysis and variance normalization prior to training and recognition can in many cases lead to a significant increase in recognition or classification performance. Surprisingly high recognition rates can even be obtained with binary histograms that reveal the polarity of local scale-space features, and which can be expected to be particularly robust to illumination variations. An overall conclusion from this study is that compared to previously used lower-dimensional histograms, the use of composed complex-cue histograms of higher dimensionality reveals the co-variation of multiple cues and enables much better recognition performance, both with regard to the problems of recognizing previously seen objects from novel views and for classifying previously unseen objects into visual categories.
TL;DR: A generic framework in which images are modelled as order-less sets of weighted visual features, each visual feature is associated with a weight factor that may inform its relevance, and it is suggested that if dense sampling is used, different schemes to weight local features can be evaluated, leading to results that are often better than the combination of multiple sampling schemes.
Abstract: This paper presents a generic framework in which images are modelled as order-less sets of weighted visual features. Each visual feature is associated with a weight factor that may inform its relevance. This framework can be applied to various bag-of-features approaches such as the bag-of-visual-word or the Fisher kernel representations. We suggest that if dense sampling is used, different schemes to weight local features can be evaluated, leading to results that are often better than the combination of multiple sampling schemes, at a much lower computational cost, because the features are extracted only once. This allows our framework to be a test-bed for saliency estimation methods in image categorisation tasks. We explored two main possibilities for the estimation of local feature relevance. The first one is based on the use of saliency maps obtained from human feedback, either by gaze tracking or by mouse clicks. The method is able to profit from such maps, leading to a significant improvement in categorisation performance. The second possibility is based on automatic saliency estimation methods, including Itti & Koch's method and SIFT's DoG. We evaluated the proposed framework and saliency estimation methods using an in house dataset and the PASCAL VOC 2008/2007 dataset, showing that some of the saliency estimation methods lead to a significant performance improvement in comparison to the standard unweighted representation.
TL;DR: An efficient solution to extract, in real-time, high-level information from an observed scene, and generate the most appropriate commands for a set of pan-tilt-zoom (PTZ) cameras in a surveillance scenario is presented.
Abstract: Cognitive visual tracking is the process of observing and understanding the behavior of a moving person. This paper presents an efficient solution to extract, in real-time, high-level information from an observed scene, and generate the most appropriate commands for a set of pan-tilt-zoom (PTZ) cameras in a surveillance scenario. Such a high-level feedback control loop, which is the main novelty of our work, will serve to reduce uncertainties in the observed scene and to maximize the amount of information extracted from it. It is implemented with a distributed camera system using SQL tables as virtual communication channels, and Situation Graph Trees for knowledge representation, inference and high-level camera control. A set of experiments in a surveillance scenario show the effectiveness of our approach and its potential for real applications of cognitive vision.