scispace - formally typeset
Search or ask a question

Showing papers presented at "Workshop on Applications of Computer Vision in 2017"


Proceedings ArticleDOI
24 Mar 2017
TL;DR: A new method for setting the learning rate, named cyclical learning rates, is described, which practically eliminates the need to experimentally find the best values and schedule for the global learning rates.
Abstract: It is known that the learning rate is the most important hyper-parameter to tune for training deep neural networks. This paper describes a new method for setting the learning rate, named cyclical learning rates, which practically eliminates the need to experimentally find the best values and schedule for the global learning rates. Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between reasonable boundary values. Training with cyclical learning rates instead of fixed values achieves improved classification accuracy without a need to tune and often in fewer iterations. This paper also describes a simple way to estimate "reasonable bounds" – linearly increasing the learning rate of the network for a few epochs. In addition, cyclical learning rates are demonstrated on the CIFAR-10 and CIFAR-100 datasets with ResNets, Stochastic Depth networks, and DenseNets, and the ImageNet dataset with the AlexNet and GoogLeNet architectures. These are practical tools for everyone who trains neural networks.

1,521 citations


Proceedings ArticleDOI
24 Mar 2017
TL;DR: T-LESS as discussed by the authors is a dataset for estimating the 6D pose of texture-less rigid objects with no significant texture and no discriminative color or reflectance properties, but some of the objects are parts of others.
Abstract: We introduce T-LESS, a new public dataset for estimating the 6D pose, i.e. translation and rotation, of texture-less rigid objects. The dataset features thirty industry-relevant objects with no significant texture and no discriminative color or reflectance properties. The objects exhibit symmetries and mutual similarities in shape and/or size. Compared to other datasets, a unique property is that some of the objects are parts of others. The dataset includes training and test images that were captured with three synchronized sensors, specifically a structured-light and a time-of-flight RGB-D sensor and a high-resolution RGB camera. There are approximately 39K training and 10K test images from each sensor. Additionally, two types of 3D models are provided for each object, i.e. a manually created CAD model and a semi-automatically reconstructed one. Training images depict individual objects against a black background. Test images originate from twenty test scenes having varying complexity, which increases from simple scenes with several isolated objects to very challenging ones with multiple instances of several objects and with a high amount of clutter and occlusion. The images were captured from a systematically sampled view sphere around the object/scene, and are annotated with accurate ground truth 6D poses of all modeled objects. Initial evaluation results indicate that the state of the art in 6D object pose estimation has ample room for improvement, especially in difficult cases with significant occlusion. The T-LESS dataset is available online at cmp:felk:cvut:cz/t-less.

289 citations


Proceedings ArticleDOI
24 Mar 2017
TL;DR: This work provides a simple universal spatial modeling method perpendicular to the RNN model enhancement and selects a set of simple geometric features, motivated by the evolution of previous work, to outperform other features and achieve state-of-art results on four datasets.
Abstract: RNN-based approaches have achieved outstanding performance on action recognition with skeleton inputs. Currently these methods limit their inputs to coordinates of joints and improve the accuracy mainly by extending RNN models to spatial domains in various ways. While such models explore relations between different parts directly from joint coordinates, we provide a simple universal spatial modeling method perpendicular to the RNN model enhancement. Specifically, we select a set of simple geometric features, motivated by the evolution of previous work. With experiments on a 3-layer LSTM framework, we observe that the geometric relational features based on distances between joints and selected lines outperform other features and achieve state-of-art results on four datasets. Further, we show the sparsity of input gate weights in the first LSTM layer trained by geometric features and demonstrate that utilizing joint-line distances as input require less data for training.

284 citations


Proceedings ArticleDOI
24 Mar 2017
TL;DR: In this paper, a subcategory-aware CNN was proposed for object detection and pose estimation, which achieved state-of-the-art performance on both detection and poses estimation on commonly used benchmarks.
Abstract: In Convolutional Neural Network (CNN)-based object detection methods, region proposal becomes a bottleneck when objects exhibit significant scale variation, occlusion or truncation. In addition, these methods mainly focus on 2D object detection and cannot estimate detailed properties of objects. In this paper, we propose subcategory-aware CNNs for object detection. We introduce a novel region proposal network that uses subcategory information to guide the proposal generating process, and a new detection network for joint detection and subcategory classification. By using subcategories related to object pose, we achieve state of-the-art performance on both detection and pose estimation on commonly used benchmarks.

276 citations


Proceedings ArticleDOI
24 Mar 2017
TL;DR: This work introduces a soft-rejection based network fusion method to fuse the soft metrics from all networks together to generate the final confidence scores, and proposes a method for integrating pixel-wise semantic segmentation network into the network fusion architecture as a reinforcement to the pedestrian detector.
Abstract: We propose a deep neural network fusion architecture for fast and robust pedestrian detection. The proposed network fusion architecture allows for parallel processing of multiple networks for speed. A single shot deep convolutional network is trained as a object detector to generate all possible pedestrian candidates of different sizes and occlusions. This network outputs a large variety of pedestrian candidates to cover the majority of ground-truth pedestrians while also introducing a large number of false positives. Next, multiple deep neural networks are used in parallel for further refinement of these pedestrian candidates. We introduce a soft-rejection based network fusion method to fuse the soft metrics from all networks together to generate the final confidence scores. Our method performs better than existing state-of-the-arts, especially when detecting small-size and occluded pedestrians. Furthermore, we propose a method for integrating pixel-wise semantic segmentation network into the network fusion architecture as a reinforcement to the pedestrian detector. The approach outperforms state-of-the-art methods on most protocols on Caltech Pedestrian dataset, with significant boosts on several protocols. It is also faster than all other methods.

251 citations


Proceedings ArticleDOI
24 Mar 2017
TL;DR: Zhang et al. as discussed by the authors employ a pre-trained deep convolutional neural network (CNN) and use its hidden features to define a feature perceptual loss for VAE training.
Abstract: We present a novel method for constructing Variational Autoencoder (VAE). Instead of using pixel-by-pixel loss, we enforce deep feature consistency between the input and the output of a VAE, which ensures the VAE's output to preserve the spatial correlation characteristics of the input, thus leading the output to have a more natural visual appearance and better perceptual quality. Based on recent deep learning works such as style transfer, we employ a pre-trained deep convolutional neural network (CNN) and use its hidden features to define a feature perceptual loss for VAE training. Evaluated on the CelebA face dataset, we show that our model produces better results than other methods in the literature. We also show that our method can produce latent vectors that can capture the semantic information of face expressions and can be used to achieve state-of-the-art performance in facial attribute prediction.

208 citations


Proceedings ArticleDOI
24 Mar 2017
TL;DR: A novel Convolutional Coding-based Rain Removal (CCRR) algorithm for automatically removing rain streaks from a single rainy image is proposed, which develops a new method for learning a set of convolutional low-rank filters for efficiently representing background clear image and rain streaks.
Abstract: We propose a novel Convolutional Coding-based Rain Removal (CCRR) algorithm for automatically removing rain streaks from a single rainy image. Our method first learns a set of generic sparsity-based and low-rank representation-based convolutional filters for efficiently representing background clear image and rain streaks, respectively. To this end, we first develop a new method for learning a set of convolutional low-rank filters. Then, using these learned filter, we propose an optimization problem to decompose a rainy image into a clear background image and a rain streak image. By working directly on the whole image, the proposed rain streak removal algorithm does not need to divide the image into overlapping patches for leaning local dictionaries. Extensive experiments on synthetic and real images show that the proposed method performs favorably compared to state-of-the-art rain streak removal algorithms.

146 citations


Proceedings ArticleDOI
24 Mar 2017
TL;DR: A deep fusion framework that more effectively exploits spatial features from CNNs with temporal features from LSTM models allowing it to achieve high accuracy outperforming current state-of-the-art methods in three widely used databases: UCF11, UCFSports, jHMDB.
Abstract: In this paper we address the problem of human action recognition from video sequences. Inspired by the exemplary results obtained via automatic feature learning and deep learning approaches in computer vision, we focus our attention towards learning salient spatial features via a convolutional neural network (CNN) and then map their temporal relationship with the aid of Long-Short-Term-Memory (LSTM) networks. Our contribution in this paper is a deep fusion framework that more effectively exploits spatial features from CNNs with temporal features from LSTM models. We also extensively evaluate their strengths and weaknesses. We find that by combining both the sets of features, the fully connected features effectively act as an attention mechanism to direct the LSTM to interesting parts of the convolutional feature sequence. The significance of our fusion method is its simplicity and effectiveness compared to other state-of-the-art methods. The evaluation results demonstrate that this hierarchical multi stream fusion method has higher performance compared to single stream mapping methods allowing it to achieve high accuracy outperforming current state-of-the-art methods in three widely used databases: UCF11, UCFSports, jHMDB.

114 citations


Proceedings ArticleDOI
24 Mar 2017
TL;DR: This paper systematically investigates the potential of Fast R-CNN and Faster R- CNN for aerial images, which achieve top performing results on common detection benchmark datasets and proposes an own network that clearly outperforms state-of-the-art methods for vehicle detection in aerial images.
Abstract: Vehicle detection in aerial images is a crucial image processing step for many applications like screening of large areas. In recent years, several deep learning based frameworks have been proposed for object detection. However, these detectors were developed for datasets that considerably differ from aerial images. In this paper, we systematically investigate the potential of Fast R-CNN and Faster R-CNN for aerial images, which achieve top performing results on common detection benchmark datasets. Therefore, the applicability of 8 state-of-the-art object proposals methods used to generate a set of candidate regions and of both detectors is examined. Relevant adaptations of the object proposals methods are provided. To overcome shortcomings of the original approach in case of handling small instances, we further propose our own network that clearly outperforms state-of-the-art methods for vehicle detection in aerial images. All experiments are performed on two publicly available datasets to account for differing characteristics such as ground sampling distance, number of objects per image and varying backgrounds.

102 citations


Proceedings ArticleDOI
24 Mar 2017
TL;DR: A novel Multi-Task Curriculum Transfer (MTCT) deep learning method to explore multiple sources of different types of web annotations with multi-labelled fine-grained attributes for model transfer learning from well-controlled shop clothing images collected from web retailers to in-the-wild images from the street.
Abstract: Recognising detailed clothing characteristics (finegrained attributes) in unconstrained images of people inthe-wild is a challenging task for computer vision, especially when there is only limited training data from the wild whilst most data available for model learning are captured in well-controlled environments using fashion models (well lit, no background clutter, frontal view, high-resolution). In this work, we develop a deep learning framework capable of model transfer learning from well-controlled shop clothing images collected from web retailers to in-the-wild images from the street. Specifically, we formulate a novel Multi-Task Curriculum Transfer (MTCT) deep learning method to explore multiple sources of different types of web annotations with multi-labelled fine-grained attributes. Our multi-task loss function is designed to extract more discriminative representations in training by jointly learning all attributes, and our curriculum strategy exploits the staged easy-to-hard transfer learning motivated by cognitive studies. We demonstrate the advantages of the MTCT model over the state-of-the-art methods on the X-Domain benchmark, a large scale clothing attribute dataset. Moreover, we show that the MTCT model has a notable advantage over contemporary models when the training data size is small.

87 citations


Proceedings ArticleDOI
24 Mar 2017
TL;DR: In this article, a fully convolutional network and a recurrent unit that works on a sliding window over the temporal data are used for online segmentation of video sequences, which can work in an online fashion instead of operating over the whole input batch of video frames.
Abstract: Image segmentation is an important step in most visual tasks. While convolutional neural networks have shown to perform well on single image segmentation, to our knowledge, no study has been done on leveraging recurrent gated architectures for video segmentation. Accordingly, we propose and implement a novel method for online segmentation of video sequences that incorporates temporal data. The network is built from a fully convolutional network and a recurrent unit that works on a sliding window over the temporal data. We use convolutional gated recurrent unit that preserves the spatial information and reduces the parameters learned. Our method has the advantage that it can work in an online fashion instead of operating over the whole input batch of video frames. The network is tested on video segmentation benchmarks in Segtrack V2 and Davis. It proved to have 5% improvement in Segtrack and 3% improvement in Davis in F-measure over a plain fully convolutional network.

Proceedings ArticleDOI
24 Mar 2017
TL;DR: Evaluating how well weights transfer between Convolutional Neural Networks trained on data from two radically different plankton imaging systems indicates that these data sets are perhaps more similar in the eyes of a machine classifier than previously assumed.
Abstract: Studying marine plankton is critical to assessing the health of the world's oceans. To sample these important populations, oceanographers are increasingly using specially engineered in situ digital imaging systems that produce very large data sets. Most automated annotation efforts have considered data from individual systems in isolation. This is predicated on the assumption that the images from each system are so different that there would be little benefit to considering out-of-domain data. Meanwhile, in the computer vision community, much effort has been dedicated to understanding how using out-of-domain images can improve the performance of machine classifiers. In this paper, we leverage these advances to evaluate how well weights transfer between Convolutional Neural Networks (CNNs) trained on data from two radically different plankton imaging systems. We also examine the utility of CNNs as feature extractors on a third unique plankton data set. Our results indicate that these data sets are perhaps more similar in the eyes of a machine classifier than previously assumed. Further, these tests underscore the value of using the rich feature representations learned by CNNs to classify data in vastly different domains.

Proceedings ArticleDOI
24 Mar 2017
TL;DR: This work designs a novel algorithm for generating Synthetic Context Logo (SCL) training images to increase model robustness against unknown background clutters, resulting in superior logo detection performance.
Abstract: Logo detection in unconstrained images is challenging, particularly when only very sparse labelled training images are accessible due to high labelling costs. In this work, we describe a model training image synthesising method capable of improving significantly logo detection performance when only a handful of (e.g., 10) labelled training images captured in realistic context are available, avoiding extensive manual labelling costs. Specifically, we design a novel algorithm for generating Synthetic Context Logo (SCL) training images to increase model robustness against unknown background clutters, resulting in superior logo detection performance. For benchmarking model performance, we introduce a new logo detection dataset TopLogo-10 collected from top 10 most popular clothing/wearable brandname logos captured in rich visual context. Extensive comparisons show the advantages of our proposed SCL model over the state-of-the-art alternatives for logo detection using two real-world logo benchmark datasets: FlickrLogo-32 and our new TopLogo-101.

Proceedings ArticleDOI
24 Mar 2017
TL;DR: A new dataset containing around 47.500 cropped X-ray images of 32 32 pixels with defects and no-defects in automotive components is released and 24 computer vision techniques including deep learning, sparse representations, local descriptors and texture features are evaluated and compared.
Abstract: To ensure safety in the construction of important metallic components for roadworthiness, it is necessary to check every component thoroughly using non-destructive testing. In last decades, X-ray testing has been adopted as the principal non-destructive testing method to identify defects within a component which are undetectable to the naked eye. Nowadays, modern computer vision techniques, such as deep learning and sparse representations, are opening new avenues in automatic object recognition in optical images. These techniques have been broadly used in object and texture recognition by the computer vision community with promising results in optical images. However, a comprehensive evaluation in X-ray testing is required. In this paper, we release a new dataset containing around 47.500 cropped X-ray images of 32 32 pixels with defects and no-defects in automotive components. Using this dataset, we evaluate and compare 24 computer vision techniques including deep learning, sparse representations, local descriptors and texture features, among others. We show in our experiments that the best performance was achieved by a simple LBP descriptor with a SVM-linear classifier obtaining 97% precision and 94% recall. We believe that the methodology presented could be used in similar projects that have to deal with automated detection of defects.

Proceedings ArticleDOI
24 Mar 2017
TL;DR: In this article, optical flow images are used as input to a convolutional neural network, which calculates a rotation and displacement for each image pixel to construct a map of where the camera has traveled.
Abstract: Visual odometry is a challenging task related to simultaneous localization and mapping that aims to generate a map traveled from a visual data stream. Based on one or two cameras, motion is estimated from features and pixel differences between frames. Because of the frame rate of the cameras, there are generally small, incremental changes between subsequent frames where optical flow can be assumed to be proportional to the physical distance moved by an egocentric reference, such as a camera on a vehicle. In this paper, a visual odometry system called Flowdometry is proposed based on optical flow and deep learning. Optical flow images are used as input to a convolutional neural network, which calculates a rotation and displacement for each image pixel. The displacements and rotations are applied incrementally to construct a map of where the camera has traveled. The proposed system is trained and tested on the KITTI visual odometry dataset, and accuracy is measured by the difference in distances between ground truth and predicted driving trajectories. Different convolutional neural network architecture configurations are tested for accuracy, and then results are compared to other state-of-the-art monocular odometry systems using the same dataset. The average translation error from the Flowdometry system is 10.77% and the average rotation error is 0.0623 degrees per meter. The total execution time of the system per optical flow frame is 0.633 seconds, which offers a 23.796x speedup over state-of-the-art methods using deep learning.

Proceedings ArticleDOI
24 Mar 2017
TL;DR: In this paper, the authors estimate the geometry of a room and the 3D pose of objects from a single 360 panorama image, assuming ManhattanWorld geometry, and formulate the task as an inference problem in which they estimate positions and orientations of walls and objects.
Abstract: This paper presents a method of estimating the geometry of a room and the 3D pose of objects from a single 360 panorama image. Assuming ManhattanWorld geometry, we formulate the task as an inference problem in which we estimate positions and orientations of walls and objects. The method combines surface normal estimation, 2D object detection and 3D object pose estimation. Quantitative results are presented on a dataset of synthetically generated 3D rooms containing objects, as well as on a subset of handlabeled images from the public SUN360 dataset.

Proceedings ArticleDOI
24 Mar 2017
TL;DR: A novel approach using 3-dimensional convolutional neural networks (C3Ds) to model the spatio-temporal information, cascaded with multimodal deep-belief networks (DBNs) that can represent the audio and video streams is introduced.
Abstract: Automatic emotion recognition has attracted great interest and numerous solutions have been proposed, most of which focus either individually on facial expression or acoustic information. While more recent research has considered multimodal approaches, individual modalities are often combined only by simple fusion at the feature and/or decision-level. In this paper, we introduce a novel approach using 3-dimensional convolutional neural networks (C3Ds) to model the spatio-temporal information, cascaded with multimodal deep-belief networks (DBNs) that can represent the audio and video streams. Experiments conducted on the eNTERFACE multimodal emotion database demonstrate that this approach leads to improved multimodal emotion recognition performance and significantly outperforms recent state-of-the-art proposals.

Proceedings ArticleDOI
24 Mar 2017
TL;DR: Improved performance of the proposed scheme for detection to detecting a contact lens using Deep Convolutional Neural Network is demonstrated with an average performance improvement of more than 10% in Correct Classification Rate (CCR%) when compared with eight different state-of-the-art contact lens detection systems.
Abstract: Contact lens detection in the eye is a significant task to improve the reliability of iris recognition systems. A contact lens overlays the iris region and prevents the iris sensor from capturing the normal iris region. In this paper, we present a novel scheme for detection to detecting a contact lens using Deep Convolutional Neural Network (CNN). The proposed CNN architecture ContlensNet is structured to have fifteen layers and configured for the three-class detection problem with the following classes: images with textured (or colored) contact lens, soft (or transparent) contact lens, and no contact lens. The proposed ContlensNet is trained using numerous iris image patches and the problem of overfitting the network is addressed by using the dropout regularization method. Extensive experiments are carried out on two publicly available large-scale databases, namely: IIIT-Delhi Contact lens iris database (IIITD) and Notre Dame cosmetic contact lens database 2013 (ND) that are comprised of contact lens iris samples captured using four different sensors. The obtained results have demonstrated the improved performance of the proposed scheme with an average performance improvement of more than 10% in Correct Classification Rate (CCR%) when compared with eight different state-of-the-art contact lens detection systems.

Proceedings ArticleDOI
24 Mar 2017
TL;DR: A novel framework to detect developmental disorders from facial images based on Deep Convolutional Neural Networks for feature extraction and results indicate that the model performs better than average human intelligence in terms of differentiating amongst different disabilities.
Abstract: Developmental Disorders are chronic disabilities that have a severe impact on the day to day functioning of a large section of the human population. Recognizing developmental disorders from facial images is an important but a relatively unexplored challenge in the field of computer vision. This paper proposes a novel framework to detect developmental disorders from facial images. A spectrum of disorders constituting of Autism Spectrum Disorder, Cerebral Palsy, Fetal Alcohol Syndrome, Down syndrome, Intellectual disability and Progeria have been considered for recognition. The framework relies on Deep Convolutional Neural Networks (DCNN) for feature extraction. A new data-set comprising of images of subjects with these disabilities was built for testing the performance of the frame work. This model has been tested on different age groups, individual disabilities and has also been compared to a similar model that uses human intelligence to identify different developmental disorders. The results indicate that the model performs better than average human intelligence in terms of differentiating amongst different disabilities and is able to recognize subjects with these developmental disorders with an accuracy of 98.80%.

Proceedings ArticleDOI
24 Mar 2017
TL;DR: This work proposes methods to generate visual summaries of long videos, and in addition proposes techniques to annotate and generate textual summary of the videos using recurrent networks.
Abstract: Long videos captured by consumers are typically tied to some of the most important moments of their lives, yet ironically are often the least frequently watched. The time required to initially retrieve and watch sections can be daunting. In this work we propose novel techniques for summarizing and annotating long videos. Existing video summarization techniques focus exclusively on identifying keyframes and subshots, however evaluating these summarized videos is a challenging task. Our work proposes methods to generate visual summaries of long videos, and in addition proposes techniques to annotate and generate textual summaries of the videos using recurrent networks. Interesting segments of long video are extracted based on image quality as well as cinematographic and consumer preference. Key frames from the most impactful segments are converted to textual annotations using sequential encoding and decoding deep learning models. Our summarization technique is benchmarked on the VideoSet dataset, and evaluated by humans for informative and linguistic content. We believe this to be the first fully automatic method capable of simultaneous visual and textual summarization of long consumer videos.

Proceedings ArticleDOI
01 Mar 2017
TL;DR: This paper presents a very simple and efficient algorithm to estimate 1, 2 or 3 orthogonal vanishing point(s) on a calibrated image in Manhattan world by building a polar grid for the line intersection points, which functions as a lookup table for the validation of vanishing point hypotheses.
Abstract: This paper presents a very simple and efficient algorithm to estimate 1, 2 or 3 orthogonal vanishing point(s) on a calibrated image in Manhattan world. Unlike the traditional methods which apply 1, 3, 4, or 6 line(s) to generate vanishing point hypotheses, we propose to use 2 lines to get the first vanishing point v1, then uniformly take sample of the second vanishing point v2 on the great circle of v1 on the equivalent sphere, and finally calculate the third vanishing point v3 by the cross-product of v1 and v2. There are three advantages of the proposed method over traditional multi-line method. First, the 2-line model is much more robust and reliable than the multi-line method, which can be applied in the scene with 1, 2 or 3 orthogonal vanishing point(s). Second, the probability of the 2-line model being formed of inner line segments can be calculated given the outlier ratio, which means that the number of iterations can be determined, and thus the estimation of vanishing points can be performed in a very simple exhaustive way instead of the traditional RANSAC method. Third, the real-time performance is achieved by building a polar grid for the line intersection points, which functions as a lookup table for the validation of vanishing point hypotheses. Our algorithm has been validated successfully in the YUD dataset and sets of challenging real images.

Proceedings ArticleDOI
01 Mar 2017
TL;DR: In this paper, a ranking-based image cropping algorithm is proposed to estimate the best proposal window through visual quality assessment or saliency detection, and the performance of an image cropper highly depends on the ability to correctly rank a number of visually similar proposal windows.
Abstract: Automatic photo cropping is an important tool for improving visual quality of digital photos without resorting to tedious manual selection. Traditionally, photo cropping is accomplished by determining the best proposal window through visual quality assessment or saliency detection. In essence, the performance of an image cropper highly depends on the ability to correctly rank a number of visually similar proposal windows. Despite the ranking nature of automatic photo cropping, little attention has been paid to learning-to-rank algorithms in tackling such a problem. In this work, we conduct an extensive study on traditional approaches as well as ranking-based croppers trained on various image features. In addition, a new dataset consisting of high quality cropping and pairwise ranking annotations is presented to evaluate the performance of various baselines. The experimental results on the new dataset provide useful insights into the design of better photo cropping algorithms.

Proceedings ArticleDOI
24 Mar 2017
TL;DR: A new unsupervised learning method to train a deep feature extractor from unlabeled images and shows the use of the learned features in a more traditional classification application for CIFAR-10 dataset.
Abstract: Spatio-temporal anomaly detection by unsupervised learning have applications in a wide range of practical settings. In this paper we present a surveillance system for industrial robots using a monocular camera. We propose a new unsupervised learning method to train a deep feature extractor from unlabeled images. Without any data augmentation, the algorithm co-learns the network parameters on different pseudo-classes simultaneously to create unbiased feature representation. Combining the learned features with a prediction system, we can detect irregularities in high dimensional data feed (e.g. video of a robot performing pick and place task). The results show how the proposed approach can detect previously unseen anomalies in the robot surveillance video. Although the technique is not designed for classification, we show the use of the learned features in a more traditional classification application for CIFAR-10 dataset.

Proceedings ArticleDOI
24 Mar 2017
TL;DR: In this article, a higher-order kernel (HOK) descriptor is generated from the late fusion of CNN classifier scores from all the frames in a sequence, which is then used as input to a video-level classifier.
Abstract: Most successful deep learning algorithms for action recognition extend models designed for image-based tasks such as object recognition to video. Such extensions are typically trained for actions on single video frames or very short clips, and then their predictions from sliding-windows over the video sequence are pooled for recognizing the action at the sequence level. Usually this pooling step uses the first-order statistics of frame-level action predictions. In this paper, we explore the advantages of using higherorder correlations, specifically, we introduce Higher-order Kernel (HOK) descriptors generated from the late fusion of CNN classifier scores from all the frames in a sequence. To generate these descriptors, we use the idea of kernel linearization. Specifically, a similarity kernel matrix, which captures the temporal evolution of deep classifier scores, is first linearized into kernel feature maps. The HOK descriptors are then generated from the higher-order cooccurrences of these feature maps, and are then used as input to a video-level classifier. We provide experiments on two fine-grained action recognition datasets, and show that our scheme leads to state-of-the-art results.

Proceedings ArticleDOI
01 Mar 2017
TL;DR: In this article, a novel ordered representation of consecutive optical flow frames is proposed to capture the action dynamics more efficiently than RGB frames, which leads to significant improvements over RGB frames while achieving accuracy comparable to the state-of-the-art on UCF101 and HMDB datasets.
Abstract: Training of Convolutional Neural Networks (CNNs) on long video sequences is computationally expensive due to the substantial memory requirements and the massive number of parameters that deep architectures demand. Early fusion of video frames is thus a standard technique, in which several consecutive frames are first agglomerated into a compact representation, and then fed into the CNN as an input sample. For this purpose, a summarization approach that represents a set of consecutive RGB frames by a single dynamic image to capture pixel dynamics is proposed recently. In this paper, we introduce a novel ordered representation of consecutive optical flow frames as an alternative and argue that this representation captures the action dynamics more efficiently than RGB frames. We provide intuitions on why such a representation is better for action recognition. We validate our claims on standard benchmark datasets and demonstrate that using summaries of flow images lead to significant improvements over RGB frames while achieving accuracy comparable to the stateof-the-art on UCF101 and HMDB datasets.

Proceedings ArticleDOI
01 Mar 2017
TL;DR: A Convolutional Neural Network framework to model geo-spatial data (specifically housing prices), to learn the spatial correlations automatically, and it is shown that neighborhood information embedded in satellite imagery can be leveraged to achieve the desired spatial smoothing.
Abstract: When modeling geo-spatial data, it is critical to capture spatial correlations for achieving high accuracy. Spatial Auto-Regression (SAR) is a common tool used to model such data, where the spatial contiguity matrix (W) encodes thespatial correlations. However, the efficacy of SAR is limited by two factors. First, it depends on the choice of contiguity matrix, which is typically not learnt from data, but instead, is assumed to be known apriori. Second, it assumes that the observations can be explained by linear models. In this paper, we propose a Convolutional Neural Network (CNN) framework to model geo-spatial data (specifically housing prices), to learn the spatial correlations automatically. We show that neighborhood information embedded in satellite imagery can be leveraged to achieve the desired spatial smoothing. An additional upside of our framework is the relaxation of linear assumption on the data. Specific challenges we tackle while implementing our framework include, (i) how much of the neighborhood is relevant while estimating housing prices? (ii) what is the right approach to capture multiple resolutions of satellite imagery? and (iii) what other data-sources can help improve the estimation of spatial correlations? We demonstrate a marked improvement of 57% on top of the SAR baseline through the use of features from deep neural networks for the cities of London, Birmingham and Liverpool.

Proceedings ArticleDOI
Oleksandr Bailo1, Seokju Lee1, Francois Rameau1, Jae Shin Yoon1, In So Kweon1 
24 Mar 2017
TL;DR: A robust approach for road marking detection and recognition from images captured by an embedded camera mounted on a car, designed to cope with illumination changes, shadows, and harsh meteorological conditions is presented.
Abstract: This paper presents a robust approach for road marking detection and recognition from images captured by an embedded camera mounted on a car. Our method is designed to cope with illumination changes, shadows, and harsh meteorological conditions. Furthermore, the algorithm can effectively group complex multi-symbol shapes into an individual road marking. For this purpose, the proposed technique relies on MSER features to obtain candidate regions which are further merged using density-based clustering. Finally, these regions of interest are recognized using machine learning approaches. Worth noting, the algorithm is versatile since it does not utilize any prior information about lane position or road space. The proposed method compares favorably to other existing works through a large number of experiments on an extensive road marking dataset.

Proceedings ArticleDOI
01 Mar 2017
TL;DR: Wang et al. as mentioned in this paper proposed a deep heterogeneous feature fusion network to exploit the complementary information present in features generated by different deep convolutional neural networks (DCNNs) for template-based face recognition.
Abstract: Although deep learning has yielded impressive performance for face recognition, many studies have shown that different networks learn different feature maps: while some networks are more receptive to pose and illumination others appear to capture more local information. Thus, in this work, we propose a deep heterogeneous feature fusion network to exploit the complementary information present in features generated by different deep convolutional neural networks (DCNNs) for template-based face recognition, where a template refers to a set of still face images or video frames from different sources which introduces more blur, pose, illumination and other variations than traditional face datasets. The proposed approach efficiently fuses the discriminative information of different deep features by 1) jointly learning the non-linear high-dimensional projection of the deep features and 2) generating a more discriminative template representation which preserves the inherent geometry of the deep features in the feature space. Experimental results on the IARPA Janus Challenge Set 3 (Janus CS3) dataset demonstrate that the proposed method can effectively improve the recognition performance. In addition, we also present a series of covariate experiments on the face verification task for in-depth qualitative evaluations for the proposed approach.

Proceedings ArticleDOI
24 Mar 2017
TL;DR: A computervision-based approach to eyelids identification and aperture estimation that runs in real time even on a single core (7ms) and is available, together with the new data set, at http://www.ti.uni-tuebingen.de/Eyelid-detection.html.
Abstract: The correct identification of the eyelids and its aperture provide essential data to infer a subject's mental state (e.g., vigilance, fatigue, and drowsiness) and to validate or reduce the search space of other eye features (e.g., pupil, and iris). This knowledge can be used not only to improve many applications, such as eye tracking and iris recognition, but also to derive information about the user (such as, the take-over readiness of the driver in the automated driving context). In this paper, we propose a computervision-based approach to eyelids identification and aperture estimation. Evaluation was performed on an existing data set from the literature as well as on a new data set introduced in this work. The new data set contains 4000 hand-labeled eye images from 11 subjects driving in a city, these contain several challenges such as reflections, makeup, wrinkles, blinks, and changing illumination. The proposed method outperformed state-of-the-art methods by up to 16.11 percentage points in terms of average similarity to the hand-labeled eyelid outline (from 34px to 12px) and 21.7 pixels (or 7.53% of the eye image height) in terms of average eyelid aperture estimation error. The proposed method implementation runs in real time even on a single core (7ms) and is available, together with the new data set, at http://www.ti.uni-tuebingen.de/Eyelid-detection.2007.0.html.

Proceedings ArticleDOI
01 Mar 2017
TL;DR: This paper proposes to use 3D shape and motion priors to regularize the estimation of the trajectory and the shape of vehicles in sequences of stereo images, representing shapes by 3D signed distance functions and embed them in a low-dimensional manifold.
Abstract: Inferring the pose and shape of vehicles in 3D from a movable platform still remains a challenging task due to the projective sensing principle of cameras, difficult surface properties e.g. reflections or transparency, and illumination changes between images. In this paper, we propose to use 3D shape and motion priors to regularize the estimation of the trajectory and the shape of vehicles in sequences of stereo images. We represent shapes by 3D signed distance functions and embed them in a low-dimensional manifold. Our optimization method allows for imposing a common shape across all image observations along an object track. We employ a motion model to regularize the trajectory to plausible object motions. We evaluate our method on the KITTI dataset and show state-of-the-art results in terms of shape reconstruction and pose estimation accuracy.