Showing papers in "Image and Vision Computing in 2017"
TL;DR: This survey provides a comprehensive review of the notable steps taken towards recognizing human actions, starting with the pioneering methods that use handcrafted representations, and then, navigating into the realm of deep learning based approaches.
Abstract: Understanding human actions in visual data is tied to advances in complementary research areas including object recognition, human dynamics, domain adaptation and semantic segmentation. Over the last decade, human action analysis evolved from earlier schemes that are often limited to controlled environments to nowadays advanced solutions that can learn from millions of videos and apply to almost all daily activities. Given the broad range of applications from video surveillance to humancomputer interaction, scientific milestones in action recognition are achieved more rapidly, eventually leading to the demise of what used to be good in a short time. This motivated us to provide a comprehensive review of the notable steps taken towards recognizing human actions. To this end, we start our discussion with the pioneering methods that use handcrafted representations, and then, navigate into the realm of deep learning based approaches. We aim to remain objective throughout this survey, touching upon encouraging improvements as well as inevitable fallbacks, in the hope of raising fresh questions and motivating new research directions for the reader. We provide a detailed review of the work on human action recognition over the past decade.We refer to actions as meaningful human motions.Including Hand-crafted representations methods, we review the impact of Deep-nets on action recognition.We follow a systematic taxonomy to highlight the essence of both Hand-crafted and Deep-net solutions.We present a comparison of methods at their algorithmic level and performance.
TL;DR: The thesis is that multimodal sentiment analysis holds a significant untapped potential with the arrival of complementary data streams for improving and going beyond text-based sentiment analysis.
Abstract: Sentiment analysis aims to automatically uncover the underlying attitude that we hold towards an entity. The aggregation of these sentiment over a population represents opinion polling and has numerous applications. Current text-based sentiment analysis rely on the construction of dictionaries and machine learning models that learn sentiment from large text corpora. Sentiment analysis from text is currently widely used for customer satisfaction assessment and brand perception analysis, among others. With the proliferation of social media, multimodal sentiment analysis is set to bring new opportunities with the arrival of complementary data streams for improving and going beyond text-based sentiment analysis. Since sentiment can be detected through affective traces it leaves, such as facial and vocal displays, multimodal sentiment analysis offers promising avenues for analyzing facial and vocal expressions in addition to the transcript or textual content. These approaches leverage emotion recognition and context inference to determine the underlying polarity and scope of an individual's sentiment. In this survey, we define sentiment and the problem of multimodal sentiment analysis and review recent developments in multimodal sentiment analysis in different domains, including spoken reviews, images, video blogs, human–machine and human–human interactions. Challenges and opportunities of this emerging field are also discussed leading to our thesis that multimodal sentiment analysis holds a significant untapped potential.
TL;DR: A multimodal approach for video-based emotion recognition in the wild using summarizing functionals of complementary visual descriptors for video modeling that combines audio and visual features with least squares regression based classifiers and weighted score level fusion.
Abstract: Multimodal recognition of affective states is a difficult problem, unless the recording conditions are carefully controlled. For recognition “in the wild”, large variances in face pose and illumination, cluttered backgrounds, occlusions, audio and video noise, as well as issues with subtle cues of expression are some of the issues to target. In this paper, we describe a multimodal approach for video-based emotion recognition in the wild. We propose using summarizing functionals of complementary visual descriptors for video modeling. These features include deep convolutional neural network (CNN) based features obtained via transfer learning, for which we illustrate the importance of flexible registration and fine-tuning. Our approach combines audio and visual features with least squares regression based classifiers and weighted score level fusion. We report state-of-the-art results on the EmotiW Challenge for “in the wild” facial expression recognition. Our approach scales to other problems, and ranked top in the ChaLearn-LAP First Impressions Challenge 2016 from video clips collected in the wild.
TL;DR: A review of the literature in vehicle detection under varying environments as well as appearance-based and motion-based methods for robust vehicle detection approaches for various on-road conditions is provided.
Abstract: Robust and efficient vehicle detection in monocular vision is an important task in Intelligent Transportation Systems. With the development of computer vision techniques and consequent accessibility of video image data, new applications have been enabled to on-road vehicle detection algorithms. This paper provides a review of the literature in vehicle detection under varying environments. Due to the variability of on-road driving environments, vehicle detection may face different problems and challenges. Therefore, many approaches have been proposed, and can be categorized as appearance-based methods and motion-based methods. In addition, special illumination, weather and driving scenarios are discussed in terms of methodology and quantitative evaluation. In the future, efforts should be focused on robust vehicle detection approaches for various on-road conditions.
TL;DR: This pipeline is able to precisely calibrate multi-camera systems, build sparse 3D maps for visual navigation, visually localize the car with respect to these maps, generate accurate dense maps, as well as detect obstacles based on real-time depth map extraction.
Abstract: Cameras are a crucial exteroceptive sensor for self-driving cars as they are low-cost and small, provide appearance information about the environment, and work in various weather conditions. They can be used for multiple purposes such as visual navigation and obstacle detection. We can use a surround multi-camera system to cover the full 360-degree field-of-view around the car. In this way, we avoid blind spots which can otherwise lead to accidents. To minimize the number of cameras needed for surround perception, we utilize fisheye cameras. Consequently, standard vision pipelines for 3D mapping, visual localization, obstacle detection, etc. need to be adapted to take full advantage of the availability of multiple cameras rather than treat each camera individually. In addition, processing of fisheye images has to be supported. In this paper, we describe the camera calibration and subsequent processing pipeline for multi-fisheye-camera systems developed as part of the V-Charge project. This project seeks to enable automated valet parking for self-driving cars. Our pipeline is able to precisely calibrate multi-camera systems, build sparse 3D maps for visual navigation, visually localize the car with respect to these maps, generate accurate dense maps, as well as detect obstacles based on real-time depth map extraction.
TL;DR: A new dataset of highly accurate per-frame annotations of valence and arousal for 600 challenging video clips extracted from feature films (also used in part for the AFEW dataset) is proposed and results show that geometric features perform well independently of the settings.
Abstract: Continuous dimensional models of human affect, such as those based on valence and arousal, have been shown to be more accurate in describing a broad range of spontaneous, everyday emotions than the more traditional models of discrete stereotypical emotion categories (e.g. happiness, surprise). However, most prior work on estimating valence and arousal considered only laboratory settings and acted data. It is unclear whether the findings of these studies also hold when the methodologies proposed in these works are tested on data collected in-the-wild. In this paper we investigate this. We propose a new dataset of highly accurate per-frame annotations of valence and arousal for 600 challenging video clips extracted from feature films (also used in part for the AFEW dataset). For each video clip, we further provide per-frame annotations of 68 facial landmarks. We subsequently evaluate a number of common baseline and state-of-the-art methods on both a commonly used laboratory recording dataset (Semaine database) and the newly proposed recording set (AFEW-VA). Our results show that geometric features perform well independently of the settings. However, as expected, methods that perform well on constrained data do not necessarily generalise to uncontrolled data and vice-versa. AFEW-VA dataset for continuous valence and arousal estimation in-the-wildReview of existing work and databases for valence and arousal estimation from audio-video cuesBaseline results and comparison of the performance of various featuresComparison of state-of-the art methods on controlled and unconstrained environments
TL;DR: This work explores how Convolutional Neural Networks, a now de facto computational machine learning tool particularly in the area of Computer Vision, can be specifically applied to the task of visual sentiment prediction and presents visualizations of local patterns that the network learned to associate with image sentiment.
Abstract: Visual multimedia have become an inseparable part of our digital social lives, and they often capture moments tied with deep affections. Automated visual sentiment analysis tools can provide a means of extracting the rich feelings and latent dispositions embedded in these media. In this work, we explore how Convolutional Neural Networks (CNNs), a now de facto computational machine learning tool particularly in the area of Computer Vision, can be specifically applied to the task of visual sentiment prediction. We accomplish this through fine-tuning experiments using a state-of-the-art CNN and via rigorous architecture analysis, we present several modifications that lead to accuracy improvements over prior art on a dataset of images from a popular social media platform. We additionally present visualizations of local patterns that the network learned to associate with image sentiment for insight into how visual positivity (or negativity) is perceived by the model.
TL;DR: The proposed multi-label convolutional neural network (MLCNN) can simultaneously predict multiple pedestrian attributes and significantly outperforms the SVM based method on the PETA database.
Abstract: Recently, pedestrian attributes like gender, age, clothing etc., have been used as soft biometric traits for recognizing people. Unlike existing methods that assume the independence of attributes during their prediction, we propose a multi-label convolutional neural network (MLCNN) to predict multiple attributes together in a unified framework. Firstly, a pedestrian image is roughly divided into multiple overlapping body parts, which are simultaneously integrated in the multi-label convolutional neural network. Secondly, these parts are filtered independently and aggregated in the cost layer. The cost function is a combination of multiple binary attribute classification cost functions. Experiments show that the proposed method significantly outperforms the SVM based method on the PETA database. Multi-label convolutional neural network for pedestrian attribute classification.The proposed MLCNN can simultaneously predict multiple pedestrian attributes.Experiments on the PETA database have shown the superiority of the proposed MLCNN.
TL;DR: A highly efficient pavement crack detection system is proposed, which has the following distinguishing features: a new description of the cracks based on the spatially clustered pixels, an improved adaptive thresholding method for image segmentation, and a novel region growing algorithm for crack detection.
Abstract: Among the various defects of asphalt pavement distress, much attention has been paid to cracks which often cause significant engineering and economic problems. Crack detection is not an easy task since images of road pavement surface are very difficult to analyze. In this paper, a highly efficient pavement crack detection system is proposed, which has the following distinguishing features. Firstly, a new description of the cracks is proposed based on the spatially clustered pixels with similar gray levels. Secondly, an adaptive thresholding method is presented for image segmentation by comprehensively taking into account the spatial distribution, intensities and geometric features of cracks. Thirdly, a new concept termed Region of Belief (ROB) is introduced to facilitate the subsequent detection by defining some credibility factors which indicate the reliability that a region could be labeled as a distress region which contains cracks, and an algorithm to extract such ROBs is devised accordingly. Lastly, a novel region growing algorithm is propounded for crack detection, which features starting with an ROB seed, determining the searching scope with a specially devised rule, and searching and merging a ROB with different regions following a similarity criterion which synthetically takes different cues into consideration. Two different types of experiments were conducted. The first one was carried out using 10,000 of our field-captured images which were taken from different road conditions and environments. The second one was completed using a benchmark dataset for a comparison with other recent publications. The evaluation performance is satisfactory for a variety of different cracks. For our own data, the detection accuracy is over 95% and more than 90% of coherent cracks without disconnected fragments have been correctly detected as the integrated ones. For the benchmark data, our detection performance also outperforms previously published results. Currently, our approach has been widely applied in China. A coarse-to-fine asphalt pavement crack detection approach is developed.A new description of the cracks is proposed based on the spatially clustered pixels.An improved adaptive thresholding method is presented for image segmentation.A new concept Region of Belief (ROB) is introduced to facilitate the detection.A novel region growing algorithm is propounded for the crack detection.
TL;DR: The goals for the Liveness Detection (LivDet) competitions are to compare software-based fingerprint liveness detection and artifact detection algorithms, as well as fingerprint systems which incorporate liveness Detection or artifact detection capabilities, using a standardized testing protocol and large quantities of spoof and live tests.
Abstract: A spoof attack, a subset of presentation attacks, is the use of an artificial replica of a biometric in an attempt to circumvent a biometric sensor. Liveness detection, or presentation attack detection, distinguishes between live and fake biometric traits and is based on the principle that additional information can be garnered above and beyond the data procured by a standard authentication system to determine if a biometric measure is authentic. The goals for the Liveness Detection (LivDet) competitions are to compare software-based fingerprint liveness detection and artifact detection algorithms (Part 1), as well as fingerprint systems which incorporate liveness detection or artifact detection capabilities (Part 2), using a standardized testing protocol and large quantities of spoof and live tests. The competitions are open to all academic and industrial institutions which have a solution for either software-based or system-based fingerprint liveness detection. The LivDet competitions have been hosted in 2009, 2011, 2013 and 2015 and have shown themselves to provide a crucial look at the current state of the art in liveness detection schemes. There has been a noticeable increase in the number of participants in LivDet competitions as well as a noticeable decrease in error rates across competitions. Participants have grown from four to the most recent thirteen submissions for Fingerprint Part 1. Fingerprints Part 2 has held steady at two submissions each competition in 2011 and 2013 and only one for the 2015 edition. The continuous increase of competitors demonstrates a growing interest in the topic.
TL;DR: A novel approach based on total variation regularization and principal component pursuit (TV-PCP) is presented to deal with the detection of infrared dim targets and shows superior detection ability under various backgrounds.
Abstract: Robust detection of infrared dim and small target contributes significantly to the infrared systems in many applications. Due to the diversity of background scene and unique characteristic of target, the detection of infrared targets remains a challenging problem. In this paper, a novel approach based on total variation regularization and principal component pursuit (TV-PCP) is presented to deal with this problem. The principal component pursuit model only considers the low-rank feature of background images, which will result in poor detection ability in non-uniform and non-smooth scenes. We take into account the total variation regularization term to thoroughly describe background feature, which can achieve good detection result as well as good background estimation result. Firstly, the input infrared image is transformed to a patch image model. Secondly, the TV-PCP model is presented on the patch image. An effective optimization algorithm is proposed to solve this model. Experiments on six real datasets show that the proposed method has superior detection ability under various backgrounds, especially with good background suppression performance and low false alarm rate. Display Omitted An infrared dim target detection method based on total variation regularization and principal component pursuit is proposed.An optimization solver based on alternating direction method is proposed to solve the TV-PCP model.By utilizing the total variation regularization, the TV-PCP performs well in target detection and background estimation.
TL;DR: This is the first detailed discussion of a systemic view of a commercial automated parking system from the perspective of computer vision algorithms and demonstrates how camera systems are crucial for addressing a range of automated parking use cases and also to add robustness to systems based on active distance measuring sensors, such as ultrasonics and radar.
Abstract: Automated driving is an active area of research in both industry and academia. Automated parking, which is automated driving in a restricted scenario of parking with low speed manoeuvring, is a key enabling product for fully autonomous driving systems. It is also an important milestone from the perspective of a higher end system built from the previous generation driver assistance systems comprising of collision warning, pedestrian detection, etc. In this paper, we discuss the design and implementation of an automated parking system from the perspective of computer vision algorithms. Designing a low-cost system with functional safety is challenging and leads to a large gap between the prototype and the end product, in order to handle all the corner cases. We demonstrate how camera systems are crucial for addressing a range of automated parking use cases and also, to add robustness to systems based on active distance measuring sensors, such as ultrasonics and radar. The key vision modules which realize the parking use cases are 3D reconstruction, parking slot marking recognition, freespace and vehicle/pedestrian detection. We detail the important parking use cases and demonstrate how to combine the vision modules to form a robust parking system. To the best of the authors' knowledge, this is the first detailed discussion of a systemic view of a commercial automated parking system.
TL;DR: A 3D cascade regression approach in which facial landmarks remain invariant across pose over a range of approximately 60 degrees is developed, which strongly support the validity of real-time, 3D registration and reconstruction from 2D video.
Abstract: To enable real-time, person-independent 3D registration from 2D video, we developed a 3D cascade regression approach in which facial landmarks remain invariant across pose over a range of approximately 60. From a single 2D image of a person's face, a dense 3D shape is registered in real time for each frame. The algorithm utilizes a fast cascade regression framework trained on high-resolution 3D face-scans of posed and spontaneous emotion expression. The algorithm first estimates the location of a dense set of landmarks and their visibility, then reconstructs face shapes by fitting a part-based 3D model. Because no assumptions are required about illumination or surface properties, the method can be applied to a wide range of imaging conditions that include 2D video and uncalibrated multi-view video. The method has been validated in a battery of experiments that evaluate its precision of 3D reconstruction, extension to multi-view reconstruction, temporal integration for videos and 3D head-pose estimation. Experimental findings strongly support the validity of real-time, 3D registration and reconstruction from 2D video. The software is available online at http://zface.org. Display Omitted 3D cascade regression approach is proposed in which facial landmarks remain invariant.From a single 2D image of a person's face, a dense 3D shape is registered in real time for each frame.Multi-view reconstruction and temporal integration for videos are presented.Method is robust for 3D head-pose estimation under various conditions.
TL;DR: This paper surveys this topic in terms of computational image enhancement, feature extraction, classification schemes and designed hardware-based acquisition set-ups to identify the path forward on ocular biometrics in visible spectrum.
Abstract: Ocular biometrics encompasses the imaging and use of characteristic features extracted from the eyes for personal recognition. Ocular biometric modalities in visible light have mainly focused on iris, blood vessel structures over the white of the eye (mostly due to conjunctival and episcleral layers), and periocular region around eye. Most of the existing studies on iris recognition use the near infrared spectrum. However, conjunctival vasculature and periocular regions are imaged in the visible spectrum. Iris recognition in the visible spectrum is possible for light color irides or by utilizing special illumination. Ocular recognition in the visible spectrum is an important research area due to factors such as recognition at a distance, suitability for recognition with regular RGB cameras, and adaptability to mobile devices. Further these ocular modalities can be obtained from a single RGB eye image, and then fused together for enhanced performance of the system. Despite these advantages, the state-of-the-art related to ocular biometrics in visible spectrum is not well known. This paper surveys this topic in terms of computational image enhancement, feature extraction, classification schemes and designed hardware-based acquisition set-ups. Future research directions are also enumerated to identify the path forward.
TL;DR: A novel cell tracking method using Convolutional Neural Networks as well as multi-task learning techniques has good performance to the cell tracking problem and an optimized model update strategy is proposed.
Abstract: Cell tracking plays crucial role in biomedical and computer vision areas. As cells generally have frequent deformation activities and small sizes in microscope image, tracking the non-rigid and non-significant cells is quite difficult in practice. Traditional visual tracking methods have good performances on tracking rigid and significant visual objects, however, they are not suitable for cell tracking problem. In this paper, a novel cell tracking method is proposed by using Convolutional Neural Networks (CNNs) as well as multi-task learning (MTL) techniques. The CNNs learn robust cell features and MTL improves the generalization performance of the tracking. The proposed cell tracking method consists of a particle filter motion model, a multi-task learning observation model, and an optimized model update strategy. In the training procedure, the cell tracking is divided into an online tracking task and an accompanying classification task using the MTL technique. The observation model is trained by building a CNN to learn robust cell features. The tracking procedure is started by assigning the cell position in the first frame of a microscope image sequence. Then, the particle filter model is applied to produce a set of candidate bounding boxes in the subsequent frames. The trained observation model provides the confidence probabilities corresponding to all of the candidates and selects the candidate with the highest probability as the final prediction. Finally, an optimized model update strategy is proposed to enable the multi-task observation model for the variation of the tracked cell over the entire tracking procedure. The performance and robustness of the proposed method are analyzed by comparing with other commonly-used methods. Experimental results demonstrate that the proposed method has good performance to the cell tracking problem.
TL;DR: This paper presents a neighborhood repulsed correlation metric learning (NRCML) method for kinship verification via facial image analysis by using the correlation similarity measure where the kin relation of facial images can be better highlighted.
Abstract: Kinship verification is an interesting and challenging problem in human face analysis, which has received increasing interests in computer vision and biometrics in recent years. This paper presents a neighborhood repulsed correlation metric learning (NRCML) method for kinship verification via facial image analysis. Most existing metric learning based kinship verification methods are developed with the Euclidian similarity metric, which is not powerful enough to measure the similarity of face samples, especially when they are captured in wild conditions. Motivated by the fact that the correlation similarity metric can better handle face variations than the Euclidian similarity metric, we propose a NRCML method by using the correlation similarity measure where the kin relation of facial images can be better highlighted. Since negative kinship samples are usually less than positive samples, we automatically identify the most discriminative negative samples in the training set to learn the distance metric so that the most discriminative encoded by negative samples can better exploited. Experimental results show the efficacy of the proposed approach. We present a method for kinship verification from facial images.A neighborhood repulsed correlation metric learning method is proposed.Experiments on two face datasets show the efficacy of the proposed approach.
TL;DR: The results of the evaluation suggest that discriminative approaches perform better than generative approaches when there are enough representative training samples, and that the generative methods are more robust to diversity of poses, but can fail to track when the motion is too quick for the effective search range of the particle filter.
Abstract: Human pose estimation is one of the most popular research topics in the past two decades, especially with the introduction of human pose datasets for benchmark evaluation. These datasets usually capture simple daily life actions. Here, we introduce a new dataset, the Martial Arts, Dancing and Sports (MADS), which consists of challenging martial arts actions (Tai-chi and Karate), dancing actions (hip-hop and jazz), and sports actions (basketball, volleyball, football, rugby, tennis and badminton). Two martial art masters, two dancers and an athlete performed these actions while being recorded with either multiple cameras or a stereo depth camera. In the multi-view or single-view setting, we provide three color views for 2D image-based human pose estimation algorithms. For depth-based human pose estimation, we provide stereo-based depth images from a single view. All videos have corresponding synchronized and calibrated ground-truth poses, which were captured using a Motion Capture system. We provide initial baseline results on our dataset using a variety of tracking frameworks, including a generative tracker based on the annealing particle filter and robust likelihood function, a discriminative tracker using twin Gaussian processes , and hybrid trackers, such as Personalized Depth Tracker . The results of our evaluation suggest that discriminative approaches perform better than generative approaches when there are enough representative training samples, and that the generative methods are more robust to diversity of poses, but can fail to track when the motion is too quick for the effective search range of the particle filter. The data and the accompanying code will be made available to the research community. We propose a new dataset called the Martial Arts, Dancing and Sports dataset for 3D human pose estimation.The dataset contains challenging actions from Tai-chi, Karate, jazz, hip-hop and sports.It contains 30 multi-view videos and 30 stereo depth videos, with a total of 53,000 frames.We provide initial results using several baseline algorithms.
TL;DR: This work trains a bunch of binary attribute classifiers which provide compact visual descriptions of faces which are able to capture meaningful attributes of faces and performs better than the previously proposed LBP-based authentication method.
Abstract: We present a method using facial attributes for continuous authentication of smartphone users. We train a bunch of binary attribute classifiers which provide compact visual descriptions of faces. The learned classifiers are applied to the image of the current user of a mobile device to extract the attributes and then authentication is done by simply comparing the calculated attributes with the enrolled attributes of the original user. Extensive experiments on two publicly available unconstrained mobile face video datasets show that our method is able to capture meaningful attributes of faces and performs better than the previously proposed LBP-based authentication method. We also provide a practical variant of our method for efficient continuous authentication on an actual mobile device by doing extensive platform evaluations of memory usage, power consumption, and authentication speed. Display Omitted Facial attributes are effective for continuous authentication on mobile devices.Attribute-based features are more robust than the low-level ones for authentication.Fusion of attribute-based and low-level features gives the best result.The proposed approach allows fast and energy efficient enrollment and authentication.
TL;DR: This paper builds a part-based statistical model of the 3D facial surface and combines it with non-rigid iterative closest point algorithms and shows that the proposed algorithm largely outperforms state-of-the-art algorithms for 3D face fitting and alignment especially when it comes to the description of the mouth region.
Abstract: The problem of fitting a 3D facial model to a 3D mesh has received a lot of attention the past 1520years. The majority of the techniques fit a general model consisting of a simple parameterisable surface or a mean 3D facial shape. The drawback of this approach is that is rather difficult to describe the non-rigid aspect of the face using just a single facial model. One way to capture the 3D facial deformations is by means of a statistical 3D model of the face or its parts. This is particularly evident when we want to capture the deformations of the mouth region. Even though statistical models of face are generally applied for modelling facial intensity, there are few approaches that fit a statistical model of 3D faces. In this paper, in order to capture and describe the non-rigid nature of facial surfaces we build a part-based statistical model of the 3D facial surface and we combine it with non-rigid iterative closest point algorithms. We show that the proposed algorithm largely outperforms state-of-the-art algorithms for 3D face fitting and alignment especially when it comes to the description of the mouth region. A statistical non-rigid ICP method for 3D face alignment is proposed.Local fitting in dynamic subdivision framework helps capture subtle facial feature.2D point-driven mesh deformation in pre-processing step helps improve performance.
TL;DR: The formulation in segmenting and recognizing gestures from two different benchmark datasets are evaluated, and the performance of the method compares favorably with state-of-the-art methods that employ Hidden Markov Models or Hidden Conditional Random Fields on the NATOPS dataset.
Abstract: A complete gesture recognition system should localize and classify each gesture from a given gesture vocabulary, within a continuous video stream. In this work, we compare two approaches: a method that performs the tasks of temporal segmentation and classification simultaneously with another that performs the tasks sequentially. The first method trains a single random forest model to recognize gestures from a given vocabulary, as presented in a training dataset of video plus 3D body joint locations, as well as out-of-vocabulary (non-gesture) instances. The second method employs a cascaded approach, training a binary random forest model to distinguish gestures from background and a multi-class random forest model to classify segmented gestures. Given a test input video stream, both frameworks are applied using sliding windows at multiple temporal scales. We evaluated our formulation in segmenting and recognizing gestures from two different benchmark datasets: the NATOPS dataset of 9600 gesture instances from a vocabulary of 24 aircraft handling signals, and the ChaLearn dataset of 7754 gesture instances from a vocabulary of 20Italian communication gestures. The performance of our method compares favorably with state-of-the-art methods that employ Hidden Markov Models or Hidden Conditional Random Fields on the NATOPS dataset. We conclude with a discussion of the advantages of using our model for the task of gesture recognition and segmentation, and outline weaknesses which need to be addressed in the future. Sequential and simultaneous random forest frameworks are compared.Fusing skeletal and appearance features enables accurate gesture representations.Uniform descriptors are created for gestures to account for variability in length.
TL;DR: This research considers autoencoder as the feature learning architecture and proposes 2,1-norm based regularization to improve its learning capacity, called as Group Sparse AutoEncoder (GSAE), and formulate the problem of minutia extraction as a two-class classification problem and learn the descriptor using the novel formulation of GSAE.
Abstract: Unsupervised feature extraction is gaining a lot of research attention following its success to represent any kind of noisy data. Owing to the presence of a lot of training parameters, these feature learning models are prone to overfitting. Different regularization methods have been explored in the literature to avoid overfitting in deep learning models. In this research, we consider autoencoder as the feature learning architecture and propose 2,1-norm based regularization to improve its learning capacity, called as Group Sparse AutoEncoder (GSAE). 2,1-norm is based on the postulate that the features from the same class will have a common sparsity pattern in the feature space. We present the learning algorithm for group sparse encoding using majorizationminimization approach. The performance of the proposed algorithm is also studied on three baseline image datasets: MNIST, CIFAR-10, and SVHN. Further, using GSAE, we propose a novel deep learning based image representation for minutia detection from latent fingerprints. Latent fingerprints contain only a partial finger region, very noisy ridge patterns, and depending on the surface it is deposited, contain significant background noise. We formulate the problem of minutia extraction as a two-class classification problem and learn the descriptor using the novel formulation of GSAE. Experimental results on two publicly available latent fingerprint datasets show that the proposed algorithm yields state-of-the-art results for automated minutia extraction. Group Sparse AutoEncoder (GSAE) learns better discriminative features compared to an unsupervised autoencoder.Class label based 2,1-regularization is incorporated to squared error reconstruction loss function using a majorization-minimization approach.The proposed GSAE is used to learn minutia representation from noisy latent fingerprint images.Results on standard image datasets, MNIST, CIFAR-10, and SVHN and latent fingerprint image datasets, NIST SD-27 and MOLF, show effectiveness of the proposed GSAE feature extraction approach.
TL;DR: Experimental results indicate that employing Strength Modelling can deliver a significant performance improvement for both arousal and valence in the unimodal and bimodal settings, and show that the proposed systems is competitive or outperform the other state-of-the-art approaches, but being with a simple implementation.
Abstract: Automatic continuous affect recognition from audiovisual cues is arguably one of the most active research areas in machine learning. In addressing this regression problem, the advantages of the models, such as the global-optimisation capability of Support Vector Machine for Regression and the context-sensitive capability of memory-enhanced neural networks, have been frequently explored, but in an isolated way. Motivated to leverage the individual advantages of these techniques, this paper proposes and explores a novel framework, Strength Modelling, where two models are concatenated in a hierarchical framework. In doing this, the strength information of the first model, as represented by its predictions, is joined with the original features, and this expanded feature space is then utilised as the input by the successive model. A major advantage of Strength Modelling, besides its ability to hierarchically explore the strength of different machine learning algorithms, is that it can work together with the conventional feature- and decision-level fusion strategies for multimodal affect recognition. To highlight the effectiveness and robustness of the proposed approach, extensive experiments have been carried out on two time- and value-continuous spontaneous emotion databases (RECOLA and SEMAINE) using audio and video signals. The experimental results indicate that employing Strength Modelling can deliver a significant performance improvement for both arousal and valence in the unimodal and bimodal settings. The results further show that the proposed systems is competitive or outperform the other state-of-the-art approaches, but being with a simple implementation.
TL;DR: The ideas behind the model are summarized and it is generalized to take into account multiple dense input streams: the image itself, stereo depth maps, and semantic class probability maps that can be generated, e.g., by deep convolutional neural networks.
Abstract: Recent progress in advanced driver assistance systems and the race towards autonomous vehicles is mainly driven by two factors: (1) increasingly sophisticated algorithms that interpret the environment around the vehicle and react accordingly, and (2) the continuous improvements of sensor technology itself. In terms of cameras, these improvements typically include higher spatial resolution, which as a consequence requires more data to be processed. The trend to add multiple cameras to cover the entire surrounding of the vehicle is not conducive in that matter. At the same time, an increasing number of special purpose algorithms need access to the sensor input data to correctly interpret the various complex situations that can occur, particularly in urban traffic.By observing those trends, it becomes clear that a key challenge for vision architectures in intelligent vehicles is to share computational resources. We believe this challenge should be faced by introducing a representation of the sensory data that provides compressed and structured access to all relevant visual content of the scene. The Stixel World discussed in this paper is such a representation. It is a medium-level model of the environment that is specifically designed to compress information about obstacles by leveraging the typical layout of outdoor traffic scenes. It has proven useful for a multitude of automotive vision applications, including object detection, tracking, segmentation, and mapping.In this paper, we summarize the ideas behind the model and generalize it to take into account multiple dense input streams: the image itself, stereo depth maps, and semantic class probability maps that can be generated, e.g., by deep convolutional neural networks. Our generalization is embedded into a novel mathematical formulation for the Stixel model. We further sketch how the free parameters of the model can be learned using structured SVMs.
TL;DR: In this article, a real-time ego-lane analysis system (ELAS) is proposed to estimate ego-lane position, classifying LMTs and road markings, performing lane departure warning (LDW), lane change detection, lane marking type (LMT) classification, road markings detection and classification, and detection of adjacent lanes (i.e., immediate left and right lanes) presence.
Abstract: Decreasing costs of vision sensors and advances in embedded hardware boosted lane related research detection, estimation, tracking, etc. in the past two decades. The interest in this topic has increased even more with the demand for advanced driver assistance systems (ADAS) and self-driving cars. Although extensively studied independently, there is still need for studies that propose a combined solution for the multiple problems related to the ego-lane, such as lane departure warning (LDW), lane change detection, lane marking type (LMT) classification, road markings detection and classification, and detection of adjacent lanes (i.e., immediate left and right lanes) presence. In this paper, we propose a real-time Ego-Lane Analysis System (ELAS) capable of estimating ego-lane position, classifying LMTs and road markings, performing LDW and detecting lane change events. The proposed vision-based system works on a temporal sequence of images. Lane marking features are extracted in perspective and Inverse Perspective Mapping (IPM) images that are combined to increase robustness. The final estimated lane is modeled as a spline using a combination of methods (Hough lines with Kalman filter and spline with particle filter). Based on the estimated lane, all other events are detected. To validate ELAS and cover the lack of lane datasets in the literature, a new dataset with more than 20 different scenes (in more than 15,000 frames) and considering a variety of scenarios (urban road, highways, traffic, shadows, etc.) was created. The dataset was manually annotated and made publicly available to enable evaluation of several events that are of interest for the research community (i.e., lane estimation, change, and centering; road markings; intersections; LMTs; crosswalks and adjacent lanes). Moreover, the system was also validated quantitatively and qualitatively on other public datasets. ELAS achieved high detection rates in all real-world events and proved to be ready for real-time applications. Display Omitted An accurate real-time real-world Ego-Lane Analysis System (ELAS)Novel manually annotated lane dataset with more than 20 scenes (+15,000 frames)We publicly released code and novel dataset.
TL;DR: In this paper, a hierarchical compositional model is proposed to recognize human activities using body poses estimated from RGB-D data, where geometric and motion descriptors are used to learn a dictionary of body poses and sparse compositions of these body poses are used for atomic human actions.
Abstract: This paper presents an approach to recognize human activities using body poses estimated from RGB-D data.We focus on recognizing complex activities composed of sequential or simultaneous atomic actions characterized by body motions of a single actor. We tackle this problem by introducing a hierarchical compositional model that operates at three levels of abstraction. At the lowest level, geometric and motion descriptors are used to learn a dictionary of body poses. At the intermediate level, sparse compositions of these body poses are used to obtain meaningful representations for atomic human actions. Finally, at the highest level, spatial and temporal compositions of these atomic actions are used to represent complex human activities.Our results show the benefits of using a hierarchical model that exploits the sharing and composition of body poses into atomic actions, and atomic actions into activities.A quantitative evaluation using two benchmark datasets illustrates the advantages of our model to perform action and activity recognition. A novel hierarchical model to recognize human activities using RGB-D data is proposed.The method jointly learns suitable representations at different abstraction levels.The model achieves multi-class discrimination providing useful mid-level annotations.The compositional capabilities of our model also bring robustness to body occlusions.
TL;DR: This paper motivates the problem of jointly and efficiently training the robust hash functions over data objects with multi-feature representations which may be noise corrupted, and proposes an approach to effectively and efficiently learning low-rank kernelized similarity consensus and hash functions.
Abstract: Learning hash functions/codes for similarity search over multi-view data is attracting increasing attention, where similar hash codes are assigned to the data objects characterizing consistently neighborhood relationship across views. Traditional methods in this category inherently suffer three limitations: 1) they commonly adopt a two-stage scheme where similarity matrix is first constructed, followed by a subsequent hash function learning; 2) these methods are commonly developed on the assumption that data samples with multiple representations are noise-free,which is not practical in real-life applications; and 3) they often incur cumbersome training model caused by the neighborhood graph construction using all N points in the database (O(N)). In this paper, we motivate the problem of jointly and efficiently training the robust hash functions over data objects with multi-feature representations which may be noise corrupted. To achieve both the robustness and training efficiency, we propose an approach to effectively and efficiently learning low-rank kernelized11We use kernelized similarity rather than kernel, as it is not a squared symmetric matrix for data-landmark affinity matrix. hash functions shared across views. Specifically, we utilize landmark graphs to construct tractable similarity matrices in multi-views to automatically discover neighborhood structure in the data. To learn robust hash functions, a latent low-rank kernel function is used to construct hash functions in order to accommodate linearly inseparable data. In particular, a latent kernelized similarity matrix is recovered by rank minimization on multiple kernel-based similarity matrices. Extensive experiments on real-world multi-view datasets validate the efficacy of our method in the presence of error corruptions.We use kernelized similarity rather than kernel, as it is not a squared symmetric matrix for data-landmark affinity matrix. A robust hashing method for multi-view data with noise corruptions is presented.It is to jointly learn a low-rank kernelized similarity consensus and hash functions.Approximate landmark graph is employed to make training fast.Extensive experiments are conducted on benchmarks to show the efficacy of our model.
TL;DR: An unlinkability and irreversibility analysis of the so-called Bloom filter-based iris biometric template protection and a Secure Multiparty Computation (SMC) protocol is suggested, that benefits of the alignment-free feature of this Bloom filter construction, in order to compute efficiently and securely the matching scores.
Abstract: In this work, we develop an unlinkability and irreversibility analysis of the so-called Bloom filter-based iris biometric template protection introduced at ICB 2013. We go further than the unlinkability analysis of Hermans et al. presented at BIOSIG 2014. Firstly we analyse unlinkability on protected templates built from two different iriscodes coming from the same iris whereas Hermans et al. analysed only protected templates from the same iriscode. Moreover we introduce an irreversibility analysis that exploits non-uniformity of the biometric data. Our experiments demonstrate new vulnerabilities of this scheme. Then we will discuss the security of other similar protected biometric templates based on Blooms filters that have been suggested in the literature since 2013. Finally we suggest a Secure Multiparty Computation (SMC) protocol, that benefits of the alignment-free feature of this Bloom filter construction, in order to compute efficiently and securely the matching scores.
TL;DR: Experimental results show promising results for two challenging datasets which have poor image quality, i.e., a remote face dataset and the Point and Shoot Face Recognition Challenge dataset.
Abstract: In this paper, we propose a robust local descriptor for face recognition It consists of two components, one based on a shearlet-decomposition and the other on local binary pattern (LBP) Shearlets can completely analyze the singular structures of piecewise smooth images, which is useful since singularities and irregular structures carry useful information in an underlying image Furthermore, LBP is effective for describing the edges extracted by shearlets even when the images contain high level of noise Experimental results using the Face Recognition Grand Challenge dataset show that the proposed local descriptor significantly outperforms many widely used features (eg, Gabor and deep learning-based features) when the images are corrupted by random noise, demonstrating robustness to noise In addition, experimental results show promising results for two challenging datasets which have poor image quality, ie, a remote face dataset and the Point and Shoot Face Recognition Challenge dataset
TL;DR: The experimental results show the superiority of the proposed technique over existing techniques in terms of classification rates, and a script-independent approach applied to English, French & Arabic handwritings.
Abstract: Detection of gender from handwriting of an individual presents an interesting research problem with applications in forensic document examination, writer identification and psychological studies. This paper presents an effective technique to predict the gender of an individual from off-line images of handwriting. The proposed technique relies on a global approach that considers writing images as textures. Each handwritten image is converted into a textur\ed image which is decomposed into a series of wavelet sub-bands at a number of levels. The wavelet sub-bands are then extended into data sequences. Each data sequence is quantized to produce a probabilistic finite state automata (PFSA) that generates feature vectors. These features are used to train two classifiers, artificial neural network and support vector machine to discriminate between male and female writings. The performance of the proposed system was evaluated on two databases, QUWI and MSHD, within a number of challenging experimental scenarios and realized classification rates of up to 80%. The experimental results show the superiority of the proposed technique over existing techniques in terms of classification rates. Prediction of gender from offline images of handwriting using textural informationWavelet transform using symbolic dynamic filtering for feature extractionClassification using support vector machine and artificial neural networksScript-independent approach applied to English, French & Arabic handwritingsImproved results on the QUWI & MSHD databases once compared to existing methods
TL;DR: Results demonstrated that AIRSV is more effective than classical AIRS and gives similar and sometimes better performance than SVM as well as the state-of-the-art methods.
Abstract: This work proposes a novel system for off-line handwritten signature verification. A new descriptor founded on a quad-tree structure of the Histogram Of Templates (HOT) is introduced. For the verification step, we propose a robust implementation of the Artificial Immune Recognition System (AIRS). This classifier is inspired from the natural immune system, which generates antibodies to protect the human body against antigens. The AIRS training develops new memory cells that are subsequently used to recognize data through a k Nearest Neighbor (kNN) classification. Presently, to get a robust verification, the kNN classification is substituted by a Support Vector (SV) decision, yielding the AIRSV classifier. Experiments are performed on three datasets, namely, MCYT-75, GPDS-300 and GPDS-4000. AIRSV performance is assessed comparatively to both conventional AIRS as well as SVM. Obtained results demonstrated that AIRSV is more effective than classical AIRS. Moreover, the proposed signature verification system gives similar and sometimes better performance than SVM as well as the state-of-the-art methods.