Showing papers in "IEEE Transactions on Pattern Analysis and Machine Intelligence in 2018"
TL;DR: This work addresses the task of semantic image segmentation with Deep Learning and proposes atrous spatial pyramid pooling (ASPP), which is proposed to robustly segment objects at multiple scales, and improves the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models.
Abstract: In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First , we highlight convolution with upsampled filters, or ‘atrous convolution’, as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second , we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third , we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed “DeepLab” system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7 percent mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.
TL;DR: The Places Database is described, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world, using the state-of-the-art Convolutional Neural Networks as baselines, that significantly outperform the previous approaches.
Abstract: The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs), we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches. Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a novel resource to guide future progress on scene recognition problems.
TL;DR: Direct Sparse Odometry (DSO) as mentioned in this paper combines a fully direct probabilistic model with consistent, joint optimization of all model parameters, including geometry represented as inverse depth in a reference frame and camera motion.
Abstract: Direct Sparse Odometry (DSO) is a visual odometry method based on a novel, highly accurate sparse and direct structure and motion formulation. It combines a fully direct probabilistic model (minimizing a photometric error) with consistent, joint optimization of all model parameters, including geometry-represented as inverse depth in a reference frame-and camera motion. This is achieved in real time by omitting the smoothness prior used in other direct methods and instead sampling pixels evenly throughout the images. Since our method does not depend on keypoint detectors or descriptors, it can naturally sample pixels from across all image regions that have intensity gradient, including edges or smooth intensity variations on essentially featureless walls. The proposed model integrates a full photometric calibration, accounting for exposure time, lens vignetting, and non-linear response functions. We thoroughly evaluate our method on three different datasets comprising several hours of video. The experiments show that the presented approach significantly outperforms state-of-the-art direct and indirect methods in a variety of real-world settings, both in terms of tracking accuracy and robustness.
TL;DR: In this paper, the authors propose a Learning without Forgetting method, which uses only new task data to train the network while preserving the original capabilities, which performs favorably compared to commonly used feature extraction and fine-tuning adaption techniques.
Abstract: When building a unified vision system or gradually adding new apabilities to a system, the usual assumption is that training data for all tasks is always available. However, as the number of tasks grows, storing and retraining on such data becomes infeasible. A new problem arises where we add new capabilities to a Convolutional Neural Network (CNN), but the training data for its existing capabilities are unavailable. We propose our Learning without Forgetting method, which uses only new task data to train the network while preserving the original capabilities. Our method performs favorably compared to commonly used feature extraction and fine-tuning adaption techniques and performs similarly to multitask learning that uses original task data we assume unavailable. A more surprising observation is that Learning without Forgetting may be able to replace fine-tuning with similar old and new task datasets for improved new task performance.
TL;DR: In this paper, Long-Term Temporal Convolutional Neural Networks (LTCNNs) were used to learn action representations with high-quality optical flow vector fields and achieved state-of-the-art results on two challenging benchmarks for action recognition.
Abstract: Typical human actions last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of a few video frames failing to model actions at their full temporal extent. In this work we learn video representations using neural networks with long-term temporal convolutions (LTC). We demonstrate that LTC-CNN models with increased temporal extents improve the accuracy of action recognition. We also study the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields and demonstrate the importance of high-quality optical flow estimation for learning accurate action models. We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 (92.7%) and HMDB51 (67.2%).
TL;DR: In this paper, a comprehensive survey of the learning to hash algorithms is presented, categorizing them according to the manners of preserving the similarities into: pairwise similarity preserving, multi-wise similarity preservation, implicit similarity preserving and quantization, and discuss their relations.
Abstract: Nearest neighbor search is a problem of finding the data points from the database such that the distances from them to the query point are the smallest. Learning to hash is one of the major solutions to this problem and has been widely studied recently. In this paper, we present a comprehensive survey of the learning to hash algorithms, categorize them according to the manners of preserving the similarities into: pairwise similarity preserving, multiwise similarity preserving, implicit similarity preserving, as well as quantization, and discuss their relations. We separate quantization from pairwise similarity preserving as the objective function is very different though quantization, as we show, can be derived from preserving the pairwise similarities. In addition, we present the evaluation protocols, and the general performance analysis, and point out that the quantization algorithms perform superiorly in terms of search accuracy, search time cost, and space cost. Finally, we introduce a few emerging topics.
TL;DR: A convolutional neural network architecture that is trainable in an end-to-end manner directly for the place recognition task, and significantly outperforms non-learnt image representations and off-the-shelf CNN descriptors on two challenging place recognition benchmarks.
Abstract: We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph We present the following four principal contributions First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the “Vector of Locally Aggregated Descriptors” image representation commonly used in image retrieval The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation Second, we create a new weakly supervised ranking loss, which enables end-to-end learning of the architecture's parameters from images depicting the same places over time downloaded from Google Street View Time Machine Third, we develop an efficient training procedure which can be applied on very large-scale weakly labelled tasks Finally, we show that the proposed architecture and training procedure significantly outperform non-learnt image representations and off-the-shelf CNN descriptors on challenging place recognition and image retrieval benchmarks
TL;DR: A comprehensive survey of instance retrieval over the last decade, presenting milestones in modern instance retrieval, reviews a broad selection of previous works in different categories, and provides insights on the connection between SIFT and CNN-based methods.
Abstract: In the early days, content-based image retrieval (CBIR) was studied with global features. Since 2003, image retrieval based on local descriptors ( de facto SIFT) has been extensively studied for over a decade due to the advantage of SIFT in dealing with image transformations. Recently, image representations based on the convolutional neural network (CNN) have attracted increasing interest in the community and demonstrated impressive performance. Given this time of rapid evolution, this article provides a comprehensive survey of instance retrieval over the last decade. Two broad categories, SIFT-based and CNN-based methods, are presented. For the former, according to the codebook size, we organize the literature into using large/medium-sized/small codebooks. For the latter, we discuss three lines of methods, i.e., using pre-trained or fine-tuned CNN models, and hybrid methods. The first two perform a single-pass of an image to the network, while the last category employs a patch-based feature extraction scheme. This survey presents milestones in modern instance retrieval, reviews a broad selection of previous works in different categories, and provides insights on the connection between SIFT and CNN-based methods. After analyzing and comparing retrieval performance of different categories on several datasets, we discuss promising directions towards generic and specialized instance retrieval.
TL;DR: A new gating mechanism within LSTM module is introduced, with which the network can learn the reliability of the sequential data and accordingly adjust the effect of the input data on the updating procedure of the long-term context representation stored in the unit's memory cell.
Abstract: Skeleton-based human action recognition has attracted a lot of research attention during the past few years. Recent works attempted to utilize recurrent neural networks to model the temporal dependencies between the 3D positional configurations of human body joints for better analysis of human activities in the skeletal data. The proposed work extends this idea to spatial domain as well as temporal domain to better analyze the hidden sources of action-related information within the human skeleton sequences in both of these domains simultaneously. Based on the pictorial structure of Kinect's skeletal data, an effective tree-structure based traversal framework is also proposed. In order to deal with the noise in the skeletal data, a new gating mechanism within LSTM module is introduced, with which the network can learn the reliability of the sequential data and accordingly adjust the effect of the input data on the updating procedure of the long-term context representation stored in the unit's memory cell. Moreover, we introduce a novel multi-modal feature fusion strategy within the LSTM unit in this paper. The comprehensive experimental results on seven challenging benchmark datasets for human action recognition demonstrate the effectiveness of the proposed method.
TL;DR: This paper presents a geodesic distance based technique that provides reliable and temporally consistent saliency measurement of superpixels as a prior for pixel-wise labeling in video saliency estimation.
Abstract: Video saliency, aiming for estimation of a single dominant object in a sequence, offers strong object-level cues for unsupervised video object segmentation. In this paper, we present a geodesic distance based technique that provides reliable and temporally consistent saliency measurement of superpixels as a prior for pixel-wise labeling. Using undirected intra-frame and inter-frame graphs constructed from spatiotemporal edges or appearance and motion, and a skeleton abstraction step to further enhance saliency estimates, our method formulates the pixel-wise segmentation task as an energy minimization problem on a function that consists of unary terms of global foreground and background models, dynamic location models, and pairwise terms of label smoothness potentials. We perform extensive quantitative and qualitative experiments on benchmark datasets. Our method achieves superior performance in comparison to the current state-of-the-art in terms of accuracy and speed.
TL;DR: A Trunk-Branch Ensemble CNN model (TBE-CNN), which extracts complementary information from holistic face images and patches cropped around facial components, achieves state-of-the-art performance on three popular video face databases: PaSC, COX Face, and YouTube Faces.
Abstract: Human faces in surveillance videos often suffer from severe image blur, dramatic pose variations, and occlusion. In this paper, we propose a comprehensive framework based on Convolutional Neural Networks (CNN) to overcome challenges in video-based face recognition (VFR). First, to learn blur-robust face representations, we artificially blur training data composed of clear still images to account for a shortfall in real-world video training data. Using training data composed of both still images and artificially blurred data, CNN is encouraged to learn blur-insensitive features automatically. Second, to enhance robustness of CNN features to pose variations and occlusion, we propose a Trunk-Branch Ensemble CNN model (TBE-CNN), which extracts complementary information from holistic face images and patches cropped around facial components. TBE-CNN is an end-to-end model that extracts features efficiently by sharing the low- and middle-level convolutional layers between the trunk and branch networks. Third, to further promote the discriminative power of the representations learnt by TBE-CNN, we propose an improved triplet loss function. Systematic experiments justify the effectiveness of the proposed techniques. Most impressively, TBE-CNN achieves state-of-the-art performance on three popular video face databases: PaSC, COX Face, and YouTube Faces. With the proposed techniques, we also obtain the first place in the BTAS 2016 Video Person Recognition Evaluation.
TL;DR: Through arming the DNN with better capability of harnessing both the feature and the class relationships, the proposed regularized DNN (rDNN) is more suitable for modeling video semantics.
Abstract: In this paper, we study the challenging problem of categorizing videos according to high-level semantics such as the existence of a particular human action or a complex event. Although extensive efforts have been devoted in recent years, most existing works combined multiple video features using simple fusion strategies and neglected the utilization of inter-class semantic relationships. This paper proposes a novel unified framework that jointly exploits the feature relationships and the class relationships for improved categorization performance. Specifically, these two types of relationships are estimated and utilized by imposing regularizations in the learning process of a deep neural network (DNN). Through arming the DNN with better capability of harnessing both the feature and the class relationships, the proposed regularized DNN (rDNN) is more suitable for modeling video semantics. We show that rDNN produces better performance over several state-of-the-art approaches. Competitive results are reported on the well-known Hollywood2 and Columbia Consumer Video benchmarks. In addition, to stimulate future research on large scale video categorization, we collect and release a new benchmark dataset, called FCVID, which contains 91,223 Internet videos and 239 manually annotated categories.
TL;DR: This work proposes a simple yet effective unsupervised hashing framework, named Similarity-Adaptive Deep Hashing (SADH), which alternatingly proceeds over three training modules: deep hash model training, similarity graph updating and binary code optimization.
Abstract: Recent vision and learning studies show that learning compact hash codes can facilitate massive data processing with significantly reduced storage and computation. Particularly, learning deep hash functions has greatly improved the retrieval performance, typically under the semantic supervision. In contrast, current unsupervised deep hashing algorithms can hardly achieve satisfactory performance due to either the relaxed optimization or absence of similarity-sensitive objective. In this work, we propose a simple yet effective unsupervised hashing framework, named Similarity-Adaptive Deep Hashing (SADH), which alternatingly proceeds over three training modules: deep hash model training, similarity graph updating and binary code optimization. The key difference from the widely-used two-step hashing method is that the output representations of the learned deep model help update the similarity graph matrix, which is then used to improve the subsequent code optimization. In addition, for producing high-quality binary codes, we devise an effective discrete optimization algorithm which can directly handle the binary constraints with a general hashing loss. Extensive experiments validate the efficacy of SADH, which consistently outperforms the state-of-the-arts by large gaps.
TL;DR: A visual question answering model that combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions and allows questions to be asked where the image alone does not contain the information required to select the appropriate answer.
Abstract: Much of the recent progress in Vision-to-Language problems has been achieved through a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This approach does not explicitly represent high-level semantic concepts, but rather seeks to progress directly from image features to text. In this paper we first propose a method of incorporating high-level concepts into the successful CNN-RNN approach, and show that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering. We further show that the same mechanism can be used to incorporate external knowledge, which is critically important for answering high level visual questions. Specifically, we design a visual question answering model that combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. It particularly allows questions to be asked where the image alone does not contain the information required to select the appropriate answer. Our final model achieves the best reported results for both image captioning and visual question answering on several of the major benchmark datasets.
TL;DR: In this article, a convolutional neural network (CNN) was employed to jointly regress to 3D bounding box coordinates and object pose for object detection and orientation estimation tasks.
Abstract: The goal of this paper is to perform 3D object detection in the context of autonomous driving. Our method aims at generating a set of high-quality 3D object proposals by exploiting stereo imagery. We formulate the problem as minimizing an energy function that encodes object size priors, placement of objects on the ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. We then exploit a CNN on top of these proposals to perform object detection. In particular, we employ a convolutional neural net (CNN) that exploits context and depth information to jointly regress to 3D bounding box coordinates and object pose. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. When combined with the CNN, our approach outperforms all existing results in object detection and orientation estimation tasks for all three KITTI object classes. Furthermore, we experiment also with the setting where LIDAR information is available, and show that using both LIDAR and stereo leads to the best result.
TL;DR: This work forms a novel view-specific person re-identification framework from the feature augmentation point of view, called Camera coR relation Aware Feature augmenTation (CRAFT), and presents a domain-generic deep person appearance representation which is designed particularly to be towards view invariant for facilitating cross-view adaptation by CRAFT.
Abstract: The challenge of person re-identification (re-id) is to match individual images of the same person captured by different non-overlapping camera views against significant and unknown cross-view feature distortion. While a large number of distance metric/subspace learning models have been developed for re-id, the cross-view transformations they learned are view-generic and thus potentially less effective in quantifying the feature distortion inherent to each camera view. Learning view-specific feature transformations for re-id (i.e., view-specific re-id), an under-studied approach, becomes an alternative resort for this problem. In this work, we formulate a novel view-specific person re-identification framework from the feature augmentation point of view, called C amera co R relation A ware F eature augmen T ation (CRAFT). Specifically, CRAFT performs cross-view adaptation by automatically measuring camera correlation from cross-view visual data distribution and adaptively conducting feature augmentation to transform the original features into a new adaptive space. Through our augmentation framework, view-generic learning algorithms can be readily generalized to learn and optimize view-specific sub-models whilst simultaneously modelling view-generic discrimination information. Therefore, our framework not only inherits the strength of view-generic model learning but also provides an effective way to take into account view specific characteristics. Our CRAFT framework can be extended to jointly learn view-specific feature transformations for person re-id across a large network with more than two cameras, a largely under-investigated but realistic re-id setting. Additionally, we present a domain-generic deep person appearance representation which is designed particularly to be towards view invariant for facilitating cross-view adaptation by CRAFT. We conducted extensively comparative experiments to validate the superiority and advantages of our proposed framework over state-of-the-art competitors on contemporary challenging person re-id datasets.
TL;DR: The Fact-Based Visual Question Answering (FVQA) dataset as mentioned in this paper ) is a VQA dataset that contains questions that require external information to answer, such as common sense, or basic factual knowledge to answer.
Abstract: Visual Question Answering (VQA) has attracted much attention in both computer vision and natural language processing communities, not least because it offers insight into the relationships between two important sources of information. Current datasets, and the models built upon them, have focused on questions which are answerable by direct analysis of the question and image alone. The set of such questions that require no external information to answer is interesting, but very limited. It excludes questions which require common sense, or basic factual knowledge to answer, for example. Here we introduce FVQA (Fact-based VQA), a VQA dataset which requires, and supports, much deeper reasoning. FVQA primarily contains questions that require external information to answer. We thus extend a conventional visual question answering dataset, which contains image-question-answer triplets, through additional image-question-answer-supporting fact tuples. Each supporting-fact is represented as a structural triplet, such as . We evaluate several baseline models on the FVQA dataset, and describe a novel model which is capable of reasoning about an image on the basis of supporting-facts.
TL;DR: Compared with state-of-the-art approaches, SSDH achieves higher retrieval accuracy, while the classification performance is not sacrificed; yet it is effective and outperforms other hashing approaches on several benchmarks and large datasets.
Abstract: This paper presents a simple yet effective supervised deep hash approach that constructs binary hash codes from labeled data for large-scale image search. We assume that the semantic labels are governed by several latent attributes with each attribute on or off , and classification relies on these attributes. Based on this assumption, our approach, dubbed supervised semantics-preserving deep hashing (SSDH), constructs hash functions as a latent layer in a deep network and the binary codes are learned by minimizing an objective function defined over classification error and other desirable hash codes properties. With this design, SSDH has a nice characteristic that classification and retrieval are unified in a single learning model. Moreover, SSDH performs joint learning of image representations, hash codes, and classification in a point-wised manner, and thus is scalable to large-scale datasets. SSDH is simple and can be realized by a slight enhancement of an existing deep architecture for classification; yet it is effective and outperforms other hashing approaches on several benchmarks and large datasets. Compared with state-of-the-art approaches, SSDH achieves higher retrieval accuracy, while the classification performance is not sacrificed.
TL;DR: The Extreme Value Machine (EVM) is a novel, theoretically sound classifier that has a well-grounded interpretation derived from statistical Extreme Value Theory (EVT), and is the first classifier to be able to perform nonlinear kernel-free variable bandwidth incremental learning.
Abstract: It is often desirable to be able to recognize when inputs to a recognition function learned in a supervised manner correspond to classes unseen at training time. With this ability, new class labels could be assigned to these inputs by a human operator, allowing them to be incorporated into the recognition function—ideally under an efficient incremental update mechanism. While good algorithms that assume inputs from a fixed set of classes exist, e.g. , artificial neural networks and kernel machines, it is not immediately obvious how to extend them to perform incremental learning in the presence of unknown query classes. Existing algorithms take little to no distributional information into account when learning recognition functions and lack a strong theoretical foundation. We address this gap by formulating a novel, theoretically sound classifier—the Extreme Value Machine (EVM). The EVM has a well-grounded interpretation derived from statistical Extreme Value Theory (EVT), and is the first classifier to be able to perform nonlinear kernel-free variable bandwidth incremental learning. Compared to other classifiers in the same deep network derived feature space, the EVM is accurate and efficient on an established benchmark partition of the ImageNet dataset.
TL;DR: This paper proposed Information Dropout, a generalization of dropout rooted in information theoretic principles that automatically adapts to the data and can better exploit architectures of limited capacity. But it does not address the problem of cross-entropy loss, which does not enforce some of the key properties of optimal representations.
Abstract: The cross-entropy loss commonly used in deep learning is closely related to the defining properties of optimal representations, but does not enforce some of the key properties. We show that this can be solved by adding a regularization term, which is in turn related to injecting multiplicative noise in the activations of a Deep Neural Network, a special case of which is the common practice of dropout. We show that our regularized loss function can be efficiently minimized using Information Dropout, a generalization of dropout rooted in information theoretic principles that automatically adapts to the data and can better exploit architectures of limited capacity. When the task is the reconstruction of the input, we show that our loss function yields a Variational Autoencoder as a special case, thus providing a link between representation learning, information theory and variational inference. Finally, we prove that we can promote the creation of optimal disentangled representations simply by enforcing a factorized prior, a fact that has been observed empirically in recent work. Our experiments validate the theoretical intuitions behind our method, and we find that Information Dropout achieves a comparable or better generalization performance than binary dropout, especially on smaller models, since it can automatically adapt the noise to the structure of the network, as well as to the test sample.
TL;DR: A systematic analysis of these networks shows that the bilinear features are highly redundant and can be reduced by an order of magnitude in size without significant loss in accuracy, and are also effective for other image classification tasks such as texture and scene recognition.
Abstract: We present a simple and effective architecture for fine-grained recognition called Bilinear Convolutional Neural Networks (B-CNNs) . These networks represent an image as a pooled outer product of features derived from two CNNs and capture localized feature interactions in a translationally invariant manner. B-CNNs are related to orderless texture representations built on deep features but can be trained in an end-to-end manner. Our most accurate model obtains 84.1, 79.4, 84.5 and 91.3 percent per-image accuracy on the Caltech-UCSD birds  , NABirds  , FGVC aircraft  , and Stanford cars  dataset respectively and runs at 30 frames-per-second on a NVIDIA Titan X GPU. We then present a systematic analysis of these networks and show that (1) the bilinear features are highly redundant and can be reduced by an order of magnitude in size without significant loss in accuracy, (2) are also effective for other image classification tasks such as texture and scene recognition, and (3) can be trained from scratch on the ImageNet dataset offering consistent improvements over the baseline architecture. Finally, we present visualizations of these models on various datasets using top activations of neural units and gradient-based inversion techniques. The source code for the complete system is available at http://vis-www.cs.umass.edu/bcnn .
TL;DR: Wang et al. as mentioned in this paper proposed a framework by using the recurrent neural network (RNN) as both a discriminative model for recognizing Chinese characters and a generator model for drawing (generating) Chinese characters.
Abstract: Recent deep learning based approaches have achieved great success on handwriting recognition. Chinese characters are among the most widely adopted writing systems in the world. Previous research has mainly focused on recognizing handwritten Chinese characters. However, recognition is only one aspect for understanding a language, another challenging and interesting task is to teach a machine to automatically write (pictographic) Chinese characters. In this paper, we propose a framework by using the recurrent neural network (RNN) as both a discriminative model for recognizing Chinese characters and a generative model for drawing (generating) Chinese characters. To recognize Chinese characters, previous methods usually adopt the convolutional neural network (CNN) models which require transforming the online handwriting trajectory into image-like representations. Instead, our RNN based approach is an end-to-end system which directly deals with the sequential structure and does not require any domain-specific knowledge. With the RNN system (combining an LSTM and GRU), state-of-the-art performance can be achieved on the ICDAR-2013 competition database. Furthermore, under the RNN framework, a conditional generative model with character embedding is proposed for automatically drawing recognizable Chinese characters. The generated characters (in vector format) are human-readable and also can be recognized by the discriminative RNN model with high accuracy. Experimental results verify the effectiveness of using RNNs as both generative and discriminative models for the tasks of drawing and recognizing Chinese characters.
TL;DR: This paper defines the tracklet confidence using the detectability and continuity of a tracklet, and decomposes a multi-object tracking problem into small subproblems based on theTracklet confidence, and solves the online multi- object tracking problem by associating tracklets and detections in different ways according to their confidence values.
Abstract: Online multi-object tracking aims at estimating the tracks of multiple objects instantly with each incoming frame and the information provided up to the moment. It still remains a difficult problem in complex scenes, because of the large ambiguity in associating multiple objects in consecutive frames and the low discriminability between objects appearances. In this paper, we propose a robust online multi-object tracking method that can handle these difficulties effectively. We first define the tracklet confidence using the detectability and continuity of a tracklet, and decompose a multi-object tracking problem into small subproblems based on the tracklet confidence. We then solve the online multi-object tracking problem by associating tracklets and detections in different ways according to their confidence values. Based on this strategy, tracklets sequentially grow with online-provided detections, and fragmented tracklets are linked up with others without any iterative and expensive association steps. For more reliable association between tracklets and detections, we also propose a deep appearance learning method to learn a discriminative appearance model from large training datasets, since the conventional appearance learning methods do not provide rich representation that can distinguish multiple objects with large appearance variations. In addition, we combine online transfer learning for improving appearance discriminability by adapting the pre-trained deep model during online tracking. Experiments with challenging public datasets show distinct performance improvement over other state-of-the-arts batch and online tracking methods, and prove the effect and usefulness of the proposed methods for online multi-object tracking.
TL;DR: Zhang et al. as discussed by the authors proposed a deep multi-task learning (DMTL) approach to jointly estimate multiple heterogeneous attributes from a single face image, which tackles attribute correlation and heterogeneity with convolutional neural networks (CNNs).
Abstract: Face attribute estimation has many potential applications in video surveillance, face retrieval, and social media. While a number of methods have been proposed for face attribute estimation, most of them did not explicitly consider the attribute correlation and heterogeneity (e.g., ordinal versus nominal and holistic versus local) during feature representation learning. In this paper, we present a Deep Multi-Task Learning (DMTL) approach to jointly estimate multiple heterogeneous attributes from a single face image. In DMTL, we tackle attribute correlation and heterogeneity with convolutional neural networks (CNNs) consisting of shared feature learning for all the attributes, and category-specific feature learning for heterogeneous attributes. We also introduce an unconstrained face database (LFW+), an extension of public-domain LFW, with heterogeneous demographic attributes (age, gender, and race) obtained via crowdsourcing. Experimental results on benchmarks with multiple face attributes (MORPH II, LFW+, CelebA, LFWA, and FotW) show that the proposed approach has superior performance compared to state of the art. Finally, evaluations on a public-domain face database (LAP) with a single attribute show that the proposed approach has excellent generalization ability.
TL;DR: This paper proposes a new learning-based hashing method called "fast supervised discrete hashing" (FSDH) based on “supervised discrete hashing” (SDH), which uses a very simple yet effective regression of the class labels of training examples to the corresponding hash code to accelerate the algorithm.
Abstract: Learning-based hashing algorithms are “hot topics” because they can greatly increase the scale at which existing methods operate. In this paper, we propose a new learning-based hashing method called “fast supervised discrete hashing” (FSDH) based on “supervised discrete hashing” (SDH). Regressing the training examples (or hash code) to the corresponding class labels is widely used in ordinary least squares regression. Rather than adopting this method, FSDH uses a very simple yet effective regression of the class labels of training examples to the corresponding hash code to accelerate the algorithm. To the best of our knowledge, this strategy has not previously been used for hashing. Traditional SDH decomposes the optimization into three sub-problems, with the most critical sub-problem - discrete optimization for binary hash codes - solved using iterative discrete cyclic coordinate descent (DCC), which is time-consuming. However, FSDH has a closed-form solution and only requires a single rather than iterative hash code-solving step, which is highly efficient. Furthermore, FSDH is usually faster than SDH for solving the projection matrix for least squares regression, making FSDH generally faster than SDH. For example, our results show that FSDH is about 12-times faster than SDH when the number of hashing bits is 128 on the CIFAR-10 data base, and FSDH is about 151-times faster than FastHash when the number of hashing bits is 64 on the MNIST data-base. Our experimental results show that FSDH is not only fast, but also outperforms other comparative methods.
TL;DR: A measure for tensor sparsity, called Kronecker-basis-representation based tensor Sparsity measure (KBR briefly), is proposed, which encodes both sparsity insights delivered by Tucker and CANDECOMP/PARAFAC low-rank decompositions for a general tensor.
Abstract: As a promising way for analyzing data, sparse modeling has achieved great success throughout science and engineering. It is well known that the sparsity/low-rank of a vector/matrix can be rationally measured by nonzero-entries-number ( $l_0$ norm)/nonzero- singular-values-number (rank), respectively. However, data from real applications are often generated by the interaction of multiple factors, which obviously cannot be sufficiently represented by a vector/matrix, while a high order tensor is expected to provide more faithful representation to deliver the intrinsic structure underlying such data ensembles. Unlike the vector/matrix case, constructing a rational high order sparsity measure for tensor is a relatively harder task. To this aim, in this paper we propose a measure for tensor sparsity, called Kronecker-basis-representation based tensor sparsity measure (KBR briefly), which encodes both sparsity insights delivered by Tucker and CANDECOMP/PARAFAC (CP) low-rank decompositions for a general tensor. Then we study the KBR regularization minimization (KBRM) problem, and design an effective ADMM algorithm for solving it, where each involved parameter can be updated with closed-form equations. Such an efficient solver makes it possible to extend KBR to various tasks like tensor completion and tensor robust principal component analysis. A series of experiments, including multispectral image (MSI) denoising, MSI completion and background subtraction, substantiate the superiority of the proposed methods beyond state-of-the-arts.
TL;DR: A novel four stream CNN architecture is introduced which can learn from RGB and optical flow frames as well as from their dynamic image representations, and achieves state-of-the-art performance in the UCF101 and HMDB51.
Abstract: We introduce the concept of dynamic image , a novel compact representation of videos useful for video analysis, particularly in combination with convolutional neural networks (CNNs). A dynamic image encodes temporal data such as RGB or optical flow videos by using the concept of ‘rank pooling’. The idea is to learn a ranking machine that captures the temporal evolution of the data and to use the parameters of the latter as a representation. We call the resulting representation dynamic image because it summarizes the video dynamics in addition to appearance. This powerful idea allows to convert any video to an image so that existing CNN models pre-trained with still images can be immediately extended to videos. We also present an efficient approximate rank pooling operator that runs two orders of magnitude faster than the standard ones with any loss in ranking performance and can be formulated as a CNN layer. To demonstrate the power of the representation, we introduce a novel four stream CNN architecture which can learn from RGB and optical flow frames as well as from their dynamic image representations. We show that the proposed network achieves state-of-the-art performance, 95.5 and 72.5 percent accuracy, in the UCF101 and HMDB51, respectively.
TL;DR: A Robust Non-Linear Knowledge Transfer Model (R-NKTM) is a deep fully-connected neural network that transfers knowledge of human actions from any unknown view to a shared high-level virtual view by finding a set of non-linear transformations that connects the views.
Abstract: Recognizing human actions from unknown and unseen (novel) views is a challenging problem. We propose a Robust Non-Linear Knowledge Transfer Model (R-NKTM) for human action recognition from novel views. The proposed R-NKTM is a deep fully-connected neural network that transfers knowledge of human actions from any unknown view to a shared high-level virtual view by finding a set of non-linear transformations that connects the views. The R-NKTM is learned from 2D projections of dense trajectories of synthetic 3D human models fitted to real motion capture data and generalizes to real videos of human actions. The strength of our technique is that we learn a single R-NKTM for all actions and all viewpoints for knowledge transfer of any real human action video without the need for re-training or fine-tuning the model. Thus, R-NKTM can efficiently scale to incorporate new action classes. R-NKTM is learned with dummy labels and does not require knowledge of the camera viewpoint at any stage. Experiments on three benchmark cross-view human action datasets show that our method outperforms existing state-of-the-art.
TL;DR: A new deep autoencoder based shared-specific feature factorization network to separate input multimodal signals into a hierarchy of components and a structured sparsity learning machine is proposed which utilizes mixed norms to apply regularization within components and group selection between them for better classification performance.
Abstract: Single modality action recognition on RGB or depth sequences has been extensively explored recently. It is generally accepted that each of these two modalities has different strengths and limitations for the task of action recognition. Therefore, analysis of the RGB+D videos can help us to better study the complementary properties of these two types of modalities and achieve higher levels of performance. In this paper, we propose a new deep autoencoder based shared-specific feature factorization network to separate input multimodal signals into a hierarchy of components. Further, based on the structure of the features, a structured sparsity learning machine is proposed which utilizes mixed norms to apply regularization within components and group selection between them for better classification performance. Our experimental results show the effectiveness of our cross-modality feature analysis framework by achieving state-of-the-art accuracy for action classification on five challenging benchmark datasets.
TL;DR: An effective blind image deblurring algorithm based on the dark channel prior that achieves the state-of-the-art results on natural images and performs favorably against methods designed for specific scenarios and can be applied to image dehazing.
Abstract: We present an effective blind image deblurring algorithm based on the dark channel prior. The motivation of this work is an interesting observation that the dark channel of blurred images is less sparse. While most patches in a clean image contain some dark pixels, this is not the case when they are averaged with neighboring ones by motion blur. This change in sparsity of the dark channel pixels is an inherent property of the motion blur process, which we prove mathematically and validate using image data. Enforcing sparsity of the dark channel thus helps blind deblurring in various scenarios such as natural, face, text, and low-illumination images. However, imposing sparsity of the dark channel introduces a non-convex non-linear optimization problem. In this work, we introduce a linear approximation to address this issue. Extensive experiments demonstrate that the proposed deblurring algorithm achieves the state-of-the-art results on natural images and performs favorably against methods designed for specific scenarios. In addition, we show that the proposed method can be applied to image dehazing.