Showing papers in "Computer Vision and Image Understanding in 2010"
TL;DR: This work addresses the problem of incorporating different types of contextual information for robust object categorization in computer vision by considering the most common levels of extraction of context and the different levels of contextual interactions.
Abstract: The goal of object categorization is to locate and identify instances of an object category within an image. Recognizing an object in an image is difficult when images include occlusion, poor quality, noise or background clutter, and this task becomes even more challenging when many objects are present in the same scene. Several models for object categorization use appearance and context information from objects to improve recognition accuracy. Appearance information, based on visual cues, can successfully identify object classes up to a certain extent. Context information, based on the interaction among objects in the scene or global scene statistics, can help successfully disambiguate appearance inputs in recognition tasks. In this work we address the problem of incorporating different types of contextual information for robust object categorization in computer vision. We review different ways of using contextual information in the field of object categorization, considering the most common levels of extraction of context and the different levels of contextual interactions. We also examine common machine learning models that integrate context information into object recognition frameworks and discuss scalability, optimizations and possible future approaches.
383 citations
TL;DR: An overview of the TRECVid shot boundary detection task, a high-level overview ofThe most significant of the approaches taken, and a comparison of performances are presented, focussing on one year (2005) as an example.
Abstract: Shot boundary detection (SBD) is the process of automatically detecting the boundaries between shots in video. It is a problem which has attracted much attention since video became available in digital form as it is an essential pre-processing step to almost all video analysis, indexing, summarisation, search, and other content-based operations. Automatic SBD was one of the tracks of activity within the annual TRECVid benchmarking exercise, each year from 2001 to 2007 inclusive. Over those seven years we have seen 57 different research groups from across the world work to determine the best approaches to SBD while using a common dataset and common scoring metrics. In this paper we present an overview of the TRECVid shot boundary detection task, a high-level overview of the most significant of the approaches taken, and a comparison of performances, focussing on one year (2005) as an example.
319 citations
TL;DR: The segmented and annotated IAPR TC-12 benchmark is introduced; an extended resource for the evaluation of AIA methods as well as the analysis of their impact on multimedia information retrieval.
Abstract: Automatic image annotation (AIA), a highly popular topic in the field of information retrieval research, has experienced significant progress within the last decade. Yet, the lack of a standardized evaluation platform tailored to the needs of AIA, has hindered effective evaluation of its methods, especially for region-based AIA. Therefore in this paper, we introduce the segmented and annotated IAPR TC-12 benchmark; an extended resource for the evaluation of AIA methods as well as the analysis of their impact on multimedia information retrieval. We describe the methodology adopted for the manual segmentation and annotation of images, and present statistics for the extended collection. The extended collection is publicly available and can be used to evaluate a variety of tasks in addition to image annotation. We also propose a soft measure for the evaluation of annotation performance and identify future research areas in which this extended test collection is likely to make a contribution.
315 citations
TL;DR: A new intensity-based calibration model is proposed that requires less input data compared to other models and thus significantly contributes to the reduction of calibration data and therefore significantly reduces the number of necessary reference images.
Abstract: Over the past years Time-of-Flight (ToF) sensors have become a considerable alternative to conventional distance sensing techniques like laser scanners or image based stereo-vision. Due to the ability to provide full-range distance information at high frame-rates, ToF sensors achieve a significant impact onto current research areas like online object recognition, collision prevention or scene and object reconstruction. Nevertheless, ToF-cameras like the Photonic Mixer Device (PMD) still exhibit a number of error sources that affect the accuracy of measured distance information. For this reason, major error sources for ToF-cameras will be discussed, along with a new calibration approach that combines intrinsic, distance as well as a reflectivity related error calibration in an overall, easy to use system and thus significantly reduces the number of necessary reference images. The main contribution, in this context, is a new intensity-based calibration model that requires less input data compared to other models and thus significantly contributes to the reduction of calibration data.
244 citations
TL;DR: This paper presents a real-time vision-based system to assist a person with dementia wash their hands, which combines a Bayesian sequential estimation framework for tracking hands and towel, with a decision-theoretic framework for computing policies of action.
Abstract: This paper presents a real-time vision-based system to assist a person with dementia wash their hands. The system uses only video inputs, and assistance is given as either verbal or visual prompts, or through the enlistment of a human caregiver's help. The system combines a Bayesian sequential estimation framework for tracking hands and towel, with a decision-theoretic framework for computing policies of action. The decision making system is a partially observable Markov decision process, or POMDP. Decision policies dictating system actions are computed in the POMDP using a point-based approximate solution technique. The tracking and decision making systems are coupled using a heuristic method for temporally segmenting the input video stream based on the continuity of the belief state. A key element of the system is the ability to estimate and adapt to user psychological states, such as awareness and responsiveness. We evaluate the system in three ways. First, we evaluate the hand-tracking system by comparing its outputs to manual annotations and to a simple hand-detection method. Second, we test the POMDP solution methods in simulation, and show that our policies have higher expected return than five other heuristic methods. Third, we report results from a ten-week trial with seven persons moderate-to-severe dementia in a long-term care facility in Toronto, Canada. The subjects washed their hands once a day, with assistance given by our automated system, or by a human caregiver, in alternating two-week periods. We give two detailed case study analyses of the system working during trials, and then show agreement between the system and independent human raters of the same trials.
230 citations
TL;DR: A Census-based stereo matching algorithm that handles difficult areas for stereo matching, such as areas with low texture, very well in comparison to state-of-the-art real-time methods and can successfully eliminate false positives to provide reliable 3D data.
Abstract: In this paper, the challenge of fast stereo matching for embedded systems is tackled. Limited resources, e.g. memory and processing power, and most importantly real-time capability on embedded systems for robotic applications, do not permit the use of most sophisticated stereo matching approaches. The strengths and weaknesses of different matching approaches have been analyzed and a well-suited solution has been found in a Census-based stereo matching algorithm. The novelty of the algorithm used is the explicit adaption and optimization of the well-known Census transform in respect to embedded real-time systems in software. The most important change in comparison with the classic Census transform is the usage of a sparse Census mask which halves the processing time with nearly unchanged matching quality. This is due the fact that large sparse Census masks perform better than small dense masks with the same processing effort. The evidence of this assumption is given by the results of experiments with different mask sizes. Another contribution of this work is the presentation of a complete stereo matching system with its correlation-based core algorithm, the detailed analysis and evaluation of the results, and the optimized high speed realization on different embedded and PC platforms. The algorithm handles difficult areas for stereo matching, such as areas with low texture, very well in comparison to state-of-the-art real-time methods. It can successfully eliminate false positives to provide reliable 3D data. The system is robust, easy to parameterize and offers high flexibility. It also achieves high performance on several, including resource-limited, systems without losing the good quality of stereo matching. A detailed performance analysis of the algorithm is given for optimized reference implementations on various commercial of the shelf (COTS) platforms, e.g. a PC, a DSP and a GPU, reaching a frame rate of up to 75 fps for 640x480 images and 50 disparities. The matching quality and processing time is compared to other algorithms on the Middlebury stereo evaluation website reaching a middle quality and top performance rank. Additional evaluation is done by comparing the results with a very fast and well-known sum of absolute differences algorithm using several Middlebury datasets and real-world scenarios.
206 citations
TL;DR: This paper investigates automated detection and identification of malaria parasites in images of Giemsa-stained thin blood film specimens by proposing a complete framework to extract these stained structures, determine whether they are parasites, and identify the infecting species and life-cycle stages.
Abstract: This paper investigates automated detection and identification of malaria parasites in images of Giemsa-stained thin blood film specimens. The Giemsa stain highlights not only the malaria parasites but also the white blood cells, platelets, and artefacts. We propose a complete framework to extract these stained structures, determine whether they are parasites, and identify the infecting species and life-cycle stages. We investigate species and life-cycle-stage identification as multi-class classification problems in which we compare three different classification schemes and empirically show that the detection, species, and life-cycle-stage tasks can be performed in a joint classification as well as an extension to binary detection. The proposed binary parasite detector can operate at 0.1% parasitemia without any false detections and with less than 10 false detections at levels as low as 0.01%.
182 citations
TL;DR: This paper suggests a simple method to use multiple reference histograms for producing a single histogram that is more appropriate for tracking the target and proposes an extension to the Mean Shift tracker where the convex hull of these histograms is used as the target model.
Abstract: The Mean Shift tracker is a widely used tool for robustly and quickly tracking the location of an object in an image sequence using the object's color histogram. The reference histogram is typically set to that in the target region in the frame where the tracking is initiated. Often, however, no single view suffices to produce a reference histogram appropriate for tracking the target. In contexts where multiple views of the target are available prior to the tracking, this paper enhances the Mean Shift tracker to use multiple reference histograms obtained from these different target views. This is done while preserving both the convergence and the speed properties of the original tracker. We first suggest a simple method to use multiple reference histograms for producing a single histogram that is more appropriate for tracking the target. Then, to enhance the tracking further, we propose an extension to the Mean Shift tracker where the convex hull of these histograms is used as the target model. Many experimental results demonstrate the successful tracking of targets whose visible colors change drastically and rapidly during the sequence, where the basic Mean Shift tracker obviously fails.
130 citations
TL;DR: An integrated method for post-processing of range data which removes outliers, smoothes the depth values and enhances the lateral resolution in order to achieve visually pleasing 3D models from low-cost depth sensors with additional (registered) color images is presented.
Abstract: We present an integrated method for post-processing of range data which removes outliers, smoothes the depth values and enhances the lateral resolution in order to achieve visually pleasing 3D models from low-cost depth sensors with additional (registered) color images. The algorithm is based on the non-local principle and adapts the original NL-Means formulation to the characteristics of typical depth data. Explicitly handling outliers in the sensor data, our denoising approach achieves unbiased reconstructions from error-prone input data. Taking intra-patch similarity into account, we reconstruct strong discontinuities without disturbing artifacts and preserve fine detail structures, obtaining piece-wise smooth depth maps. Furthermore, we exploit the dependencies of the depth data with additionally available color information and increase the lateral resolution of the depth maps. We finally discuss how to parallelize the algorithm in order to achieve fast processing times that are adequate for post-processing of data from fast depth sensors such as time-of-flight cameras.
126 citations
TL;DR: The objectives of this work are to propose pre-processing methods and improvements in support vector machines to increase the accuracy achieved while the number of support vectors, and thus theNumber of operations needed in the test phase, is reduced.
Abstract: Pattern recognition methods are used in the final stage of a traffic sign detection and recognition system, where the main objective is to categorize a detected sign. Support vector machines have been reported as a good method to achieve this main target due to their ability to provide good accuracy as well as being sparse methods. Nevertheless, for complete data sets of traffic signs the number of operations needed in the test phase is still large, whereas the accuracy needs to be improved. The objectives of this work are to propose pre-processing methods and improvements in support vector machines to increase the accuracy achieved while the number of support vectors, and thus the number of operations needed in the test phase, is reduced. Results show that with the proposed methods the accuracy is increased 3-5% with a reduction in the number of support vectors of 50-70%.
124 citations
TL;DR: The development of an automatic breast tissue classification methodology is described, which can be summarized in a number of distinct steps: (1) preprocessing, (2) feature extraction, and (3) classification.
Abstract: Mammographic density is known to be an important indicator of breast cancer risk. Classification of mammographic density based on statistical features has been investigated previously. However, in those approaches the entire breast including the pectoral muscle has been processed to extract features. In this approach the region of interest is restricted to the breast tissue alone eliminating the artifacts, background and the pectoral muscle. The mammogram images used in this study are from the Mini-MIAS digital database. Here, we describe the development of an automatic breast tissue classification methodology, which can be summarized in a number of distinct steps: (1) preprocessing, (2) feature extraction, and (3) classification. Gray level thresholding and connected component labeling is used to eliminate the artifacts and pectoral muscles from the region of interest. Statistical features are extracted from this region which signify the important texture features of breast tissue. These features are fed to the support vector machine (SVM) classifier to classify it into any of the three classes namely fatty, glandular and dense tissue.The classifier accuracy obtained is 95.44%.
TL;DR: The aim of the paper is to show the advantages of using a efficient modeling of the processing occurring at retina level and in the V1 visual cortex in order to develop efficient and fast bio-inspired modules for low level image processing.
Abstract: An efficient modeling of the processing occurring at retina level and in the V1 visual cortex has been proposed in [1,2]. The aim of the paper is to show the advantages of using such a modeling in order to develop efficient and fast bio-inspired modules for low level image processing. At the retina level, a spatio-temporal filtering ensures accurate structuring of video data (noise and illumination variation removal, static and dynamic contour enhancement). In the V1 cortex, a frequency and orientation based analysis is performed. The combined use of retina and V1 cortex modeling allows the development of low level image processing modules for contour enhancement, for moving contour extraction, for motion analysis and for motion event detection. Each module is described and its performances are evaluated. The retina model has been integrated into a real-time C/C++ optimized program which is also presented in this paper with the derived computer vision tools.
TL;DR: A novel stereo matching algorithm that is designed for high efficiency when realized in hardware and designed for the deployment in Field Programmable Gate Arrays and Application Specific Integrated Circuits (ASICs) is proposed.
Abstract: To enable both accurate and fast real-time stereo vision in embedded systems, we propose a novel stereo matching algorithm that is designed for high efficiency when realized in hardware. We evaluate its accuracy using the Middlebury Stereo Evaluation, revealing its high performance at minimum tolerance. To outline the resource efficiency of the algorithm, we present its realization as an Intellectual Property (IP) core that is designed for the deployment in Field Programmable Gate Arrays (FPGAs) and Application Specific Integrated Circuits (ASICs).
TL;DR: This paper presents a method for estimating six degrees of freedom camera motions from central catadioptric images in man-made environments by decoupling the rotation and the translation and shows that the line-based approach allows to estimate the absolute attitude at each frame, without error accumulation.
Abstract: Previous works have shown that catadioptric systems are particularly suited for egomotion estimation thanks to their large field of view and thus numerous algorithms have already been proposed in the literature to estimate the motion. In this paper, we present a method for estimating six degrees of freedom camera motions from central catadioptric images in man-made environments. State-of-the-art methods can obtain very impressive results. However, our proposed system provides two strong advantages over the existing methods: first, it can implicitly handle the difficulty of planar/non-planar scenes, and second, it is computationally much less expensive. The only assumption deals with the presence of parallel straight lines which is reasonable in a man-made environment. More precisely, we estimate the motion by decoupling the rotation and the translation. The rotation is computed by an efficient algorithm based on the detection of dominant bundles of parallel catadioptric lines and the translation is calculated from a robust 2-point algorithm. We also show that the line-based approach allows to estimate the absolute attitude (roll and pitch angles) at each frame, without error accumulation. The efficiency of our approach has been validated by experiments in both indoor and outdoor environments and also by comparison with other existing methods.
TL;DR: This work presents an approximate solution to the problem of visually finding an object in a mostly unknown space with a mobile robot and investigates its performance and properties to conclude that this approach is sufficient to solve this problem and has additional desirable empirical characteristics.
Abstract: Consider the problem of visually finding an object in a mostly unknown space with a mobile robot. It is clear that all possible views and images cannot be examined in a practical system. Visual attention is a complex phenomenon; we view it as a mechanism that optimizes the search processes inherent in vision (Tsotsos, 2001; Tsotsos et al., 2008) [1,2]. Here, we describe a particular example of a practical robotic vision system that employs some of these attentive processes. We cast this as an optimization problem, i.e., optimizing the probability of finding the target given a fixed cost limit in terms of total number of robotic actions required to find the visual target. Due to the inherent intractability of this problem, we present an approximate solution and investigate its performance and properties. We conclude that our approach is sufficient to solve this problem and has additional desirable empirical characteristics.
TL;DR: An active perception system, consisting of a camera mounted on a pan-tilt unit and a 360^o RFID detection system, both embedded on a mobile robot, and a multi-sensor-based control strategy based on the tracker outputs and on the RFID data is designed.
Abstract: In this paper, we address the problem of realizing a human following task in a crowded environment. We consider an active perception system, consisting of a camera mounted on a pan-tilt unit and a 360^o RFID detection system, both embedded on a mobile robot. To perform such a task, it is necessary to efficiently track humans in crowds. In a first step, we have dealt with this problem using the particle filtering framework because it enables the fusion of heterogeneous data, which improves the tracking robustness. In a second step, we have considered the problem of controlling the robot motion to make the robot follow the person of interest. To this aim, we have designed a multi-sensor-based control strategy based on the tracker outputs and on the RFID data. Finally, we have implemented the tracker and the control strategy on our robot. The obtained experimental results highlight the relevance of the developed perceptual functions. Possible extensions of this work are discussed at the end of the article.
TL;DR: The proposed approach outperforms existing works such as scale invariant feature transform (SIFT), or the speeded-up robust features (SURF), and is robust to some changes in illumination, viewpoint, color distribution, image quality, and object deformation.
Abstract: Most multi-camera systems assume a well structured environment to detect and track objects across cameras. Cameras need to be fixed and calibrated, or only objects within a training data can be detected (e.g. pedestrians only). In this work, a master-slave system is presented to detect and track any objects in a network of uncalibrated fixed and mobile cameras. Cameras can have non-overlapping field-of-views. Objects are detected with the mobile cameras (the slaves) given only observations from the fixed cameras (the masters). No training stage and data are used. Detected objects are correctly tracked across cameras leading to a better understanding of the scene. A cascade of grids of region descriptors is proposed to describe any object of interest. To lend insight on the addressed problem, most state-of-the-art region descriptors are evaluated given various schemes. The covariance matrix of various features, the histogram of colors, the histogram of oriented gradients, the scale invariant feature transform (SIFT), the speeded-up robust features (SURF) descriptors, and the color interest points [1] are evaluated. A sparse scan of the cameras'image plane is also presented to reduce the search space of the localization process, approaching nearly real-time performance. The proposed approach outperforms existing works such as scale invariant feature transform (SIFT), or the speeded-up robust features (SURF). The approach is robust to some changes in illumination, viewpoint, color distribution, image quality, and object deformation. Objects with partial occlusion are also detected and tracked.
TL;DR: The results demonstrate that the CA is capable of being trained to perform many different tasks, and that the quality of these results is in many cases comparable or better than established specialised algorithms.
Abstract: This paper describes the application of cellular automata (CA) to various image processing tasks such as denoising and feature detection. Whereas our previous work mainly dealt with binary images, the current work operates on intensity images. The increased number of cell states (i.e. pixel intensities) leads to a vast increase in the number of possible rules. Therefore, a reduced intensity representation is used, leading to a three state CA that is more practical. In addition, a modified sequential floating forward search mechanism is developed in order to speed up the selection of good rule sets in the CA training stage. Results are compared with our previous method based on threshold decomposition, and are found to be generally superior. The results demonstrate that the CA is capable of being trained to perform many different tasks, and that the quality of these results is in many cases comparable or better than established specialised algorithms.
TL;DR: Experimental results show that the proposed fuzzy color histogram-based shot-boundary detection algorithm effectively detects shot boundaries and reduces false alarms as compared to the state-of-the-art shot- boundary detection algorithms.
Abstract: We present a fuzzy color histogram-based shot-boundary detection algorithm specialized for content-based copy detection applications The proposed method aims to detect both cuts and gradual transitions (fade, dissolve) effectively in videos where heavy transformations (such as cam-cording, insertions of patterns, strong re-encoding) occur Along with the color histogram generated with the fuzzy linking method on L*a*b* color space, the system extracts a mask for still regions and the window of picture-in-picture transformation for each detected shot, which will be useful in a content-based copy detection system Experimental results show that our method effectively detects shot boundaries and reduces false alarms as compared to the state-of-the-art shot-boundary detection algorithms
TL;DR: This is the first study to document degraded iris biometrics performance with non-cosmetic contact lenses.
Abstract: Many iris recognition systems operate under the assumption that non-cosmetic contact lenses have no or minimal effect on iris biometrics performance and convenience. In this paper we show results of a study of 12,003 images from 87 contact-lens-wearing subjects and 9697 images from 124 non-contact-lens-wearing subjects. We visually classified the contact lens images into four categories according to the type of lens effects observed in the image. Our results show different degradations in performance for different types of contact lenses. Lenses that produce larger artifacts on the iris yield more degraded performance. This is the first study to document degraded iris biometrics performance with non-cosmetic contact lenses.
TL;DR: This paper compares four approaches to achieve a compact codebook vocabulary while retaining categorization performance and investigates the trade-off between codebook compactness and categorizationperformance.
Abstract: In the face of current large-scale video libraries, the practical applicability of content-based indexing algorithms is constrained by their efficiency. This paper strives for efficient large-scale video indexing by comparing various visual-based concept categorization techniques. In visual categorization, the popular codebook model has shown excellent categorization performance. The codebook model represents continuous visual features by discrete prototypes predefined in a vocabulary. The vocabulary size has a major impact on categorization efficiency, where a more compact vocabulary is more efficient. However, smaller vocabularies typically score lower on classification performance than larger vocabularies. This paper compares four approaches to achieve a compact codebook vocabulary while retaining categorization performance. For these four methods, we investigate the trade-off between codebook compactness and categorization performance. We evaluate the methods on more than 200h of challenging video data with as many as 101 semantic concepts. The results allow us to create a taxonomy of the four methods based on their efficiency and categorization performance.
TL;DR: A three module system based on both 2D and 3D cues that gives rise to a promising system to detect pedestrians in urban scenarios using Real AdaBoost, Haar wavelets and edge orientation histograms.
Abstract: During the next decade, on-board pedestrian detection systems will play a key role in the challenge of increasing traffic safety. The main target of these systems, to detect pedestrians in urban scenarios, implies overcoming difficulties like processing outdoor scenes from a mobile platform and searching for aspect-changing objects in cluttered environments. This makes such systems combine techniques in the state-of-the-art Computer Vision. In this paper we present a three module system based on both 2D and 3D cues. The first module uses 3D information to estimate the road plane parameters and thus select a coherent set of regions of interest (ROIs) to be further analyzed. The second module uses Real AdaBoost and a combined set of Haar wavelets and edge orientation histograms to classify the incoming ROIs as pedestrian or non-pedestrian. The final module loops again with the 3D cue in order to verify the classified ROIs and with the 2D in order to refine the final results. According to the results, the integration of the proposed techniques gives rise to a promising system.
TL;DR: It is demonstrated that concept detection in web video is feasible, and that - when testing on YouTube videos - the YouTube-based detector outperforms the ones trained on standard training sets.
Abstract: Concept detection is targeted at automatically labeling video content with semantic concepts appearing in it, like objects, locations, or activities. While concept detectors have become key components in many research prototypes for content-based video retrieval, their practical use is limited by the need for large-scale annotated training sets. To overcome this problem, we propose to train concept detectors on material downloaded from web-based video sharing portals like YouTube, such that training is based on tags given by users during upload, no manual annotation is required, and concept detection can scale up to thousands of concepts. On the downside, web video as training material is a complex domain, and the tags associated with it are weak and unreliable. Consequently, performance loss is to be expected when replacing high-quality state-of-the-art training sets with web video content. This paper presents a concept detection prototype named TubeTagger that utilizes YouTube content for an autonomous training. In quantitative experiments, we compare the performance when training on web video and on standard datasets from the literature. It is demonstrated that concept detection in web video is feasible, and that - when testing on YouTube videos - the YouTube-based detector outperforms the ones trained on standard training sets. By applying the YouTube-based prototype to datasets from the literature, we further demonstrate that: (1) If training annotations on the target domain are available, the resulting detectors significantly outperform the YouTube-based tagger. (2) If no annotations are available, the YouTube-based detector achieves comparable performance to the ones trained on standard datasets (moderate relative performance losses of 11.4% is measured) while offering the advantage of a fully automatic, scalable learning. (3) By enriching conventional training sets with online video material, performance improvements of 11.7% can be achieved when generalizing to domains unseen in training.
TL;DR: A novel probabilistic approach to data association, that takes into account that features can also move between cameras under robot motion, is presented, that circumvents the combinatorial data association problem by using an incremental expectation maximization algorithm.
Abstract: We propose to use a multi-camera rig for simultaneous localization and mapping (SLAM), providing flexibility in sensor placement on mobile robot platforms while exploiting the stronger localization constraints provided by omni-directional sensors. In this context, we present a novel probabilistic approach to data association, that takes into account that features can also move between cameras under robot motion. Our approach circumvents the combinatorial data association problem by using an incremental expectation maximization algorithm. In the expectation step we determine a distribution over correspondences by sampling. In the maximization step, we find optimal parameters of a density over the robot motion and environment structure. By summarizing the sampling results in so-called virtual measurements, the resulting optimization simplifies to the equivalent optimization problem for known correspondences. We present results for simulated data, as well as for data obtained by a mobile robot equipped with a multi-camera rig.
TL;DR: This paper presents an approach for view-invariant gesture recognition based on 3D data captured by a SwissRanger SR4000 camera, which is similar to the recognition rate when training and testing on gestures from the same viewpoint, hence the approach is indeed view- Invariant.
Abstract: This paper presents an approach for view-invariant gesture recognition. The approach is based on 3D data captured by a SwissRanger SR4000 camera. This camera produces both a depth map as well as an intensity image of a scene. Since the two information types are aligned, we can use the intensity image to define a region of interest for the relevant 3D data. This data fusion improves the quality of the motion detection and hence results in better recognition. The gesture recognition is based on finding motion primitives (temporal instances) in the 3D data. Motion is detected by a 3D version of optical flow and results in velocity annotated point clouds. The 3D motion primitives are represented efficiently by introducing motion context. The motion context is transformed into a view-invariant representation using spherical harmonic basis functions, yielding a harmonic motion context representation. A probabilistic Edit Distance classifier is applied to identify which gesture best describes a string of primitives. The approach is trained on data from one viewpoint and tested on data from a very different viewpoint. The recognition rate is 94.4% which is similar to the recognition rate when training and testing on gestures from the same viewpoint, hence the approach is indeed view-invariant.
TL;DR: The proposed method has been shown to perform very well with both noisy and noise-free images from multimodal datasets, outperforming conventional methods in terms of fusion quality and noise reduction in the fused output.
Abstract: This paper describes a new methodology for multimodal image fusion based on non-Gaussian statistical modelling of wavelet coefficients. Special emphasis is placed on the fusion of noisy images. The use of families of generalised Gaussian and alpha-stable distributions for modelling image wavelet coefficients is investigated and methods for estimating distribution parameters are proposed. Improved techniques for image fusion are developed, by incorporating these models into a weighted average image fusion algorithm. The proposed method has been shown to perform very well with both noisy and noise-free images from multimodal datasets, outperforming conventional methods in terms of fusion quality and noise reduction in the fused output.
TL;DR: A modified four-source PS algorithm is presented which enhances the surface normal estimates by assigning a likelihood measure for each pixel being in a shadowed region, determined by the discrepancies between measured pixel brightnesses and expected values.
Abstract: This paper seeks to advance the state-of-the-art in 3D face capture and processing via novel Photometric Stereo (PS) hardware and algorithms. The first contribution is a new high-speed 3D data capture system, which is capable of acquiring four raw images in approximately 20ms. The results presented in this paper demonstrate the feasibility of deploying the device in commercial settings. We show how the device can operate with either visible light or near infrared (NIR) light. The NIR light sources offer the advantages of being less intrusive and more covert than most existing face recognition methods allow. Furthermore, our experiments show that the accuracy of the reconstructions is also better using NIR light. The paper also presents a modified four-source PS algorithm which enhances the surface normal estimates by assigning a likelihood measure for each pixel being in a shadowed region. This likelihood measure is determined by the discrepancies between measured pixel brightnesses and expected values. Where the likelihood of shadow is high, then one light source is omitted from the computation for that pixel, otherwise a weighted combination of pixels is used to determine the surface normal. This means that the precise shadow boundary is not required by our method. The results section of the paper provides a detailed analysis of the methods presented and a comparison to ground truth. We also analyse the reflectance properties of a small number of skin samples to test the validity of the Lambertian model and point towards potential improvements to our method using the Oren-Nayar model.
TL;DR: Experimental results demonstrate that the developed tracker is capable of handling several challenging situations, where the labels of objects are correctly identified and maintained over time, despite the complex interactions among the tracked objects that lead to several layers of occlusions.
Abstract: We present a robust object tracking algorithm that handles spatially extended and temporally long object occlusions. The proposed approach is based on the concept of ''object permanence'' which suggests that a totally occluded object will re-emerge near its occluder. The proposed method does not require prior training to account for differences in the shape, size, color or motion of the objects to be tracked. Instead, the method automatically and dynamically builds appropriate object representations that enable robust and effective tracking and occlusion reasoning. The proposed approach has been evaluated on several image sequences showing either complex object manipulation tasks or human activity in the context of surveillance applications. Experimental results demonstrate that the developed tracker is capable of handling several challenging situations, where the labels of objects are correctly identified and maintained over time, despite the complex interactions among the tracked objects that lead to several layers of occlusions.
TL;DR: Under the general assumption of stationary foreground appearance, it is shown that robust object tracking is possible by adaptively adjusting the locations of these blocks, which are modeled as a small number of rectangular blocks whose positions within the tracking window are adaptively determined.
Abstract: We propose an algorithm for accurate tracking of articulated objects using online update of appearance and shape. The challenge here is to model foreground appearance with histograms in a way that is both efficient and accurate. In this algorithm, the constantly changing foreground shape is modeled as a small number of rectangular blocks, whose positions within the tracking window are adaptively determined. Under the general assumption of stationary foreground appearance, we show that robust object tracking is possible by adaptively adjusting the locations of these blocks. Implemented in MATLAB without substantial optimization, our tracker runs already at 3.7 frames per second on a 3GHz machine. Experimental results have demonstrated that the algorithm is able to efficiently track articulated objects undergoing large variation in appearance and shape.
TL;DR: This paper presents a light-weight and efficient background modeling and foreground detection algorithm that is highly robust against lighting variations and non-static backgrounds including scenes with swaying trees, water fountains and rain.
Abstract: An embedded smart camera is a stand-alone unit that not only captures images, but also includes a processor, memory and communication interface. Battery-powered, embedded smart cameras introduce many additional challenges since they have very limited resources, such as energy, processing power and memory. Computer vision algorithms running on these camera boards should be light-weight and efficient. Considering the memory requirements of an algorithm and its portability to an embedded processor should be an integral part of the algorithm design in addition to the accuracy requirements. This paper presents a light-weight and efficient background modeling and foreground detection algorithm that is highly robust against lighting variations and non-static backgrounds including scenes with swaying trees, water fountains and rain. Compared to many traditional methods, the memory requirement for the data saved for each pixel is very small in the proposed algorithm. Moreover, the number of memory accesses and instructions are adaptive, and are decreased depending on the amount of activity in the scene. Each pixel is treated differently based on its history, and instead of requiring the same number of memory accesses and instructions for every pixel, we require less instructions for stable background pixels. The plot of the number of unstable pixels at each frame also serves as a tool to find the video portions with high activity. The proposed method selectively updates the background model with an automatically adaptive rate, thus can adapt to rapid changes. As opposed to traditional methods, pixels are not always treated individually, and information about neighbors is incorporated into decision making. The results obtained with nine challenging outdoor and indoor sequences are presented, and compared with the results of different state-of-the-art background subtraction methods. The ROC curves and memory comparison of different background subtraction methods are also provided. The experimental results demonstrate the success of the proposed light-weight salient foreground detection method.