scispace - formally typeset
Search or ask a question

Showing papers on "Feature vector published in 2017"


Posted Content
TL;DR: A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA.
Abstract: Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

2,248 citations


Proceedings ArticleDOI
07 Aug 2017
TL;DR: Neural Factorization Machines (NFM) as discussed by the authors is a special case of NFM without hidden layers, which combines the linearity of FM in modelling second-order feature interactions and the non-linearity of neural network in modelling higher-order features.
Abstract: Many predictive tasks of web applications need to model categorical variables, such as user IDs and demographics like genders and occupations. To apply standard machine learning techniques, these categorical predictors are always converted to a set of binary features via one-hot encoding, making the resultant feature vector highly sparse. To learn from such sparse data effectively, it is crucial to account for the interactions between features. Factorization Machines (FMs) are a popular solution for efficiently using the second-order feature interactions. However, FM models feature interactions in a linear way, which can be insufficient for capturing the non-linear and complex inherent structure of real-world data. While deep neural networks have recently been applied to learn non-linear feature interactions in industry, such as the Wide&Deep by Google and DeepCross by Microsoft, the deep structure meanwhile makes them difficult to train. In this paper, we propose a novel model Neural Factorization Machine (NFM) for prediction under sparse settings. NFM seamlessly combines the linearity of FM in modelling second-order feature interactions and the non-linearity of neural network in modelling higher-order feature interactions. Conceptually, NFM is more expressive than FM since FM can be seen as a special case of NFM without hidden layers. Empirical results on two regression tasks show that with one hidden layer only, NFM significantly outperforms FM with a 7.3% relative improvement. Compared to the recent deep learning methods Wide&Deep and DeepCross, our NFM uses a shallower structure but offers better performance, being much easier to train and tune in practice.

910 citations


Journal ArticleDOI
TL;DR: The experimental results confirm the efficiency of the proposed approaches in improving the classification accuracy compared to other wrapper-based algorithms, which insures the ability of WOA algorithm in searching the feature space and selecting the most informative attributes for classification tasks.

853 citations


Proceedings ArticleDOI
TL;DR: Two feature squeezing methods are explored: reducing the color bit depth of each pixel and spatial smoothing, which are inexpensive and complementary to other defenses, and can be combined in a joint detection framework to achieve high detection rates against state-of-the-art attacks.
Abstract: Although deep neural networks (DNNs) have achieved great success in many tasks, they can often be fooled by \emph{adversarial examples} that are generated by adding small but purposeful distortions to natural examples. Previous studies to defend against adversarial examples mostly focused on refining the DNN models, but have either shown limited success or required expensive computation. We propose a new strategy, \emph{feature squeezing}, that can be used to harden DNN models by detecting adversarial examples. Feature squeezing reduces the search space available to an adversary by coalescing samples that correspond to many different feature vectors in the original space into a single sample. By comparing a DNN model's prediction on the original input with that on squeezed inputs, feature squeezing detects adversarial examples with high accuracy and few false positives. This paper explores two feature squeezing methods: reducing the color bit depth of each pixel and spatial smoothing. These simple strategies are inexpensive and complementary to other defenses, and can be combined in a joint detection framework to achieve high detection rates against state-of-the-art attacks.

696 citations


Proceedings ArticleDOI
01 Jul 2017
TL;DR: In this paper, an encoder aims to project a visual feature vector into the semantic space as in the existing ZSL models, but the decoder exerts an additional constraint, that the projection/code must be able to reconstruct the original visual feature.
Abstract: Existing zero-shot learning (ZSL) models typically learn a projection function from a feature space to a semantic embedding space (e.g. attribute space). However, such a projection function is only concerned with predicting the training seen class semantic representation (e.g. attribute prediction) or classification. When applied to test data, which in the context of ZSL contains different (unseen) classes without training data, a ZSL model typically suffers from the project domain shift problem. In this work, we present a novel solution to ZSL based on learning a Semantic AutoEncoder (SAE). Taking the encoder-decoder paradigm, an encoder aims to project a visual feature vector into the semantic space as in the existing ZSL models. However, the decoder exerts an additional constraint, that is, the projection/code must be able to reconstruct the original visual feature. We show that with this additional reconstruction constraint, the learned projection function from the seen classes is able to generalise better to the new unseen classes. Importantly, the encoder and decoder are linear and symmetric which enable us to develop an extremely efficient learning algorithm. Extensive experiments on six benchmark datasets demonstrate that the proposed SAE outperforms significantly the existing ZSL models with the additional benefit of lower computational cost. Furthermore, when the SAE is applied to supervised clustering problem, it also beats the state-of-the-art.

645 citations


Journal ArticleDOI
TL;DR: Experimental results demonstrated that the proposed SAE-DBN approach can effectively identify the machine running conditions and significantly outperform other fusion methods.
Abstract: To assess health conditions of rotating machinery efficiently, multiple accelerometers are mounted on different locations to acquire a variety of possible faults signals. The statistical features are extracted from these signals to identify the running status of a machine. However, the acquired vibration signals are different due to sensor’s arrangement and environmental interference, which may lead to different diagnostic results. In order to improve the fault diagnosis reliability, a new multisensor data fusion technique is proposed. First, time-domain and frequency-domain features are extracted from the different sensor signals, and then these features are input into multiple two-layer sparse autoencoder (SAE) neural networks for feature fusion. Finally, fused feature vectors can be regarded as the machine health indicators, and be used to train deep belief network (DBN) for further classification. To verify the effectiveness of the proposed SAE-DBN scheme, the bearing fault experiments were conducted on a bearing test platform, and the vibration data sets under different running speeds were collected for algorithm validation. For comparison, different feature fusion methods were also applied to multisensor fusion in the experiments. Experimental results demonstrated that the proposed approach can effectively identify the machine running conditions and significantly outperform other fusion methods.

632 citations


Posted Content
TL;DR: This work proposes a neural embedding framework named graph2vec to learn data-driven distributed representations of arbitrary sized graphs that achieves significant improvements in classification and clustering accuracies over substructure representation learning approaches and are competitive with state-of-the-art graph kernels.
Abstract: Recent works on representation learning for graph structured data predominantly focus on learning distributed representations of graph substructures such as nodes and subgraphs. However, many graph analytics tasks such as graph classification and clustering require representing entire graphs as fixed length feature vectors. While the aforementioned approaches are naturally unequipped to learn such representations, graph kernels remain as the most effective way of obtaining them. However, these graph kernels use handcrafted features (e.g., shortest paths, graphlets, etc.) and hence are hampered by problems such as poor generalization. To address this limitation, in this work, we propose a neural embedding framework named graph2vec to learn data-driven distributed representations of arbitrary sized graphs. graph2vec's embeddings are learnt in an unsupervised manner and are task agnostic. Hence, they could be used for any downstream task such as graph classification, clustering and even seeding supervised representation learning approaches. Our experiments on several benchmark and large real-world datasets show that graph2vec achieves significant improvements in classification and clustering accuracies over substructure representation learning approaches and are competitive with state-of-the-art graph kernels.

469 citations


Posted Content
TL;DR: This work presents a novel solution to ZSL based on learning a Semantic AutoEncoder (SAE), which outperforms significantly the existing ZSL models with the additional benefit of lower computational cost and beats the state-of-the-art when the SAE is applied to supervised clustering problem.
Abstract: Existing zero-shot learning (ZSL) models typically learn a projection function from a feature space to a semantic embedding space (e.g.~attribute space). However, such a projection function is only concerned with predicting the training seen class semantic representation (e.g.~attribute prediction) or classification. When applied to test data, which in the context of ZSL contains different (unseen) classes without training data, a ZSL model typically suffers from the project domain shift problem. In this work, we present a novel solution to ZSL based on learning a Semantic AutoEncoder (SAE). Taking the encoder-decoder paradigm, an encoder aims to project a visual feature vector into the semantic space as in the existing ZSL models. However, the decoder exerts an additional constraint, that is, the projection/code must be able to reconstruct the original visual feature. We show that with this additional reconstruction constraint, the learned projection function from the seen classes is able to generalise better to the new unseen classes. Importantly, the encoder and decoder are linear and symmetric which enable us to develop an extremely efficient learning algorithm. Extensive experiments on six benchmark datasets demonstrate that the proposed SAE outperforms significantly the existing ZSL models with the additional benefit of lower computational cost. Furthermore, when the SAE is applied to supervised clustering problem, it also beats the state-of-the-art.

455 citations


Proceedings ArticleDOI
21 Jul 2017
TL;DR: This paper used a CNN to regress 3DMM shape and texture parameters directly from an input photo and achieved state-of-the-art results on the LFW, YTF and IJB-A benchmarks.
Abstract: The 3D shapes of faces are well known to be discriminative. Yet despite this, they are rarely used for face recognition and always under controlled viewing conditions. We claim that this is a symptom of a serious but often overlooked problem with existing methods for single view 3D face reconstruction: when applied in the wild, their 3D estimates are either unstable and change for different photos of the same subject or they are over-regularized and generic. In response, we describe a robust method for regressing discriminative 3D morphable face models (3DMM). We use a convolutional neural network (CNN) to regress 3DMM shape and texture parameters directly from an input photo. We overcome the shortage of training data required for this purpose by offering a method for generating huge numbers of labeled examples. The 3D estimates produced by our CNN surpass state of the art accuracy on the MICC data set. Coupled with a 3D-3D face matching pipeline, we show the first competitive face recognition results on the LFW, YTF and IJB-A benchmarks using 3D face shapes as representations, rather than the opaque deep feature vectors used by other modern systems.

451 citations


Posted Content
TL;DR: This work proposes an approach of combining an off-the-shelf network with a principled loss function inspired by a metric learning objective that encourages a convolutional network to produce a representation of the image that can easily be clustered into instances with a simple post-processing step.
Abstract: Semantic instance segmentation remains a challenging task. In this work we propose to tackle the problem with a discriminative loss function, operating at the pixel level, that encourages a convolutional network to produce a representation of the image that can easily be clustered into instances with a simple post-processing step. The loss function encourages the network to map each pixel to a point in feature space so that pixels belonging to the same instance lie close together while different instances are separated by a wide margin. Our approach of combining an off-the-shelf network with a principled loss function inspired by a metric learning objective is conceptually simple and distinct from recent efforts in instance segmentation. In contrast to previous works, our method does not rely on object proposals or recurrent mechanisms. A key contribution of our work is to demonstrate that such a simple setup without bells and whistles is effective and can perform on par with more complex methods. Moreover, we show that it does not suffer from some of the limitations of the popular detect-and-segment approaches. We achieve competitive performance on the Cityscapes and CVPPP leaf segmentation benchmarks.

449 citations


Journal ArticleDOI
Chao Du1, Fuxi Cai1, Mohammed A. Zidan1, Wen Ma1, Seung Hwan Lee1, Wei Lu1 
TL;DR: It is shown that the internal ionic dynamic processes of memristors allow the memristor-based reservoir to directly process information in the temporal domain, and it is demonstrated that even a small hardware system with only 88memristors can already be used for tasks, such as handwritten digit recognition.
Abstract: Reservoir computing systems utilize dynamic reservoirs having short-term memory to project features from the temporal inputs into a high-dimensional feature space. A readout function layer can then effectively analyze the projected features for tasks, such as classification and time-series analysis. The system can efficiently compute complex and temporal data with low-training cost, since only the readout function needs to be trained. Here we experimentally implement a reservoir computing system using a dynamic memristor array. We show that the internal ionic dynamic processes of memristors allow the memristor-based reservoir to directly process information in the temporal domain, and demonstrate that even a small hardware system with only 88 memristors can already be used for tasks, such as handwritten digit recognition. The system is also used to experimentally solve a second-order nonlinear task, and can successfully predict the expected output without knowing the form of the original dynamic transfer function. Reservoir computing facilitates the projection of temporal input signals onto a high-dimensional feature space via a dynamic system, known as the reservoir. Du et al. realise this concept using metal-oxide-based memristors with short-term memory to perform digit recognition tasks and solve non-linear problems.

Proceedings ArticleDOI
01 Oct 2017
TL;DR: A semi-supervised framework is proposed – based on Generative Adversarial Networks (GANs) – which consists of a generator network to provide extra training examples to a multi-class classifier, acting as discriminator in the GAN framework, that assigns sample a label y from the K possible classes or marks it as a fake sample (extra class).
Abstract: Semantic segmentation has been a long standing challenging task in computer vision. It aims at assigning a label to each image pixel and needs a significant number of pixel-level annotated data, which is often unavailable. To address this lack of annotations, in this paper, we leverage, on one hand, a massive amount of available unlabeled or weakly labeled data, and on the other hand, non-real images created through Generative Adversarial Networks. In particular, we propose a semi-supervised framework – based on Generative Adversarial Networks (GANs) – which consists of a generator network to provide extra training examples to a multi-class classifier, acting as discriminator in the GAN framework, that assigns sample a label y from the K possible classes or marks it as a fake sample (extra class). The underlying idea is that adding large fake visual data forces real samples to be close in the feature space, which, in turn, improves multiclass pixel classification. To ensure a higher quality of generated images by GANs with consequently improved pixel classification, we extend the above framework by adding weakly annotated data, i.e., we provide class level information to the generator. We test our approaches on several challenging benchmarking visual datasets, i.e. PASCAL, SiftFLow, Stanford and CamVid, achieving competitive performance compared to state-of-the-art semantic segmentation methods.

Posted Content
TL;DR: Neural Factorization Machines (NFM) as mentioned in this paper is a special case of NFM without hidden layers, which combines the linearity of FM in modelling second-order feature interactions and the non-linearity of neural network in modelling higher order feature interactions.
Abstract: Many predictive tasks of web applications need to model categorical variables, such as user IDs and demographics like genders and occupations. To apply standard machine learning techniques, these categorical predictors are always converted to a set of binary features via one-hot encoding, making the resultant feature vector highly sparse. To learn from such sparse data effectively, it is crucial to account for the interactions between features. Factorization Machines (FMs) are a popular solution for efficiently using the second-order feature interactions. However, FM models feature interactions in a linear way, which can be insufficient for capturing the non-linear and complex inherent structure of real-world data. While deep neural networks have recently been applied to learn non-linear feature interactions in industry, such as the Wide&Deep by Google and DeepCross by Microsoft, the deep structure meanwhile makes them difficult to train. In this paper, we propose a novel model Neural Factorization Machine (NFM) for prediction under sparse settings. NFM seamlessly combines the linearity of FM in modelling second-order feature interactions and the non-linearity of neural network in modelling higher-order feature interactions. Conceptually, NFM is more expressive than FM since FM can be seen as a special case of NFM without hidden layers. Empirical results on two regression tasks show that with one hidden layer only, NFM significantly outperforms FM with a 7.3% relative improvement. Compared to the recent deep learning methods Wide&Deep and DeepCross, our NFM uses a shallower structure but offers better performance, being much easier to train and tune in practice.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: This paper proposes to build graphs over the scene objects and over the question words, and describes a deep neural network that exploits the structure in these representations, and achieves significant improvements over the state-of-the-art.
Abstract: This paper proposes to improve visual question answering (VQA) with structured representations of both scene contents and questions. A key challenge in VQA is to require joint reasoning over the visual and text domains. The predominant CNN/LSTM-based approach to VQA is limited by monolithic vector representations that largely ignore structure in the scene and in the question. CNN feature vectors cannot effectively capture situations as simple as multiple object instances, and LSTMs process questions as series of words, which do not reflect the true complexity of language structure. We instead propose to build graphs over the scene objects and over the question words, and we describe a deep neural network that exploits the structure in these representations. We show that this approach achieves significant improvements over the state-of-the-art, increasing accuracy from 71.2% to 74.4% in accuracy on the abstract scenes multiple-choice benchmark, and from 34.7% to 39.1% in accuracy over pairs of balanced scenes, i.e. images with fine-grained differences and opposite yes/no answers to a same question.

Journal ArticleDOI
Yang Zhan1, Kun Fu1, Menglong Yan1, Xian Sun1, Hongqi Wang1, Xiaosong Qiu1 
TL;DR: A novel supervised change detection method based on a deep siamese convolutional network for optical aerial images that is comparable, even better, with the two state-of-the-art methods in terms of F-measure.
Abstract: In this letter, we propose a novel supervised change detection method based on a deep siamese convolutional network for optical aerial images. We train a siamese convolutional network using the weighted contrastive loss. The novelty of the method is that the siamese network is learned to extract features directly from the image pairs. Compared with hand-crafted features used by the conventional change detection method, the extracted features are more abstract and robust. Furthermore, because of the advantage of the weighted contrastive loss function, the features have a unique property: the feature vectors of the changed pixel pair are far away from each other, while the ones of the unchanged pixel pair are close. Therefore, we use the distance of the feature vectors to detect changes between the image pair. Simple threshold segmentation on the distance map can even obtain good performance. For improvement, we use a $k$ -nearest neighbor approach to update the initial result. Experimental results show that the proposed method produces results comparable, even better, with the two state-of-the-art methods in terms of F-measure.

Book ChapterDOI
14 Nov 2017
TL;DR: A convolutional autoencoders structure is developed to learn embedded features in an end-to-end way and a clustering oriented loss is directly built on embedded features to jointly perform feature refinement and cluster assignment.
Abstract: Deep clustering utilizes deep neural networks to learn feature representation that is suitable for clustering tasks. Though demonstrating promising performance in various applications, we observe that existing deep clustering algorithms either do not well take advantage of convolutional neural networks or do not considerably preserve the local structure of data generating distribution in the learned feature space. To address this issue, we propose a deep convolutional embedded clustering algorithm in this paper. Specifically, we develop a convolutional autoencoders structure to learn embedded features in an end-to-end way. Then, a clustering oriented loss is directly built on embedded features to jointly perform feature refinement and cluster assignment. To avoid feature space being distorted by the clustering loss, we keep the decoder remained which can preserve local structure of data in feature space. In sum, we simultaneously minimize the reconstruction loss of convolutional autoencoders and the clustering loss. The resultant optimization problem can be effectively solved by mini-batch stochastic gradient descent and back-propagation. Experiments on benchmark datasets empirically validate the power of convolutional autoencoders for feature learning and the effectiveness of local structure preservation.

Journal ArticleDOI
TL;DR: A generic computer vision system designed for exploiting trained deep Convolutional Neural Networks as a generic feature extractor and mixing these features with more traditional hand-crafted features is presented, demonstrating the generalizability of the proposed approach.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: Methods, systems, and computer-readable storage media for receiving a vocabulary, the vocabulary including text data that is provided as at least a portion of raw data, and the trained neural attention model being used to automatically determine aspects from the vocabulary.
Abstract: Methods, systems, and computer-readable storage media for receiving a vocabulary, the vocabulary including text data that is provided as at least a portion of raw data, the raw data being provided in a computer-readable file, associating each word in the vocabulary with a feature vector, providing a sentence embedding for each sentence of the vocabulary based on a plurality of feature vectors to provide a plurality of sentence embeddings, providing a reconstructed sentence embedding for each sentence embedding based on a weighted parameter matrix to provide a plurality of reconstructed sentence embeddings, and training the unsupervised neural attention model based on the sentence embeddings and the reconstructed sentence embeddings to provide a trained neural attention model, the trained neural attention model being used to automatically determine aspects from the vocabulary.

Posted Content
25 Jul 2017
TL;DR: A combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of the method to VQA.
Abstract: Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed the selective convolutional descriptor aggregation (SCDA) method, which first localizes the main object in fine-grained images and then aggregates the selected descriptors into a short feature vector using the best practices.
Abstract: Deep convolutional neural network models pre-trained for the ImageNet classification task have been successfully adopted to tasks in other domains, such as texture description and object proposal generation, but these tasks require annotations for images in the new domain. In this paper, we focus on a novel and challenging task in the pure unsupervised setting: fine-grained image retrieval. Even with image labels, fine-grained images are difficult to classify, letting alone the unsupervised retrieval task. We propose the selective convolutional descriptor aggregation (SCDA) method. The SCDA first localizes the main object in fine-grained images, a step that discards the noisy background and keeps useful deep descriptors. The selected descriptors are then aggregated and the dimensionality is reduced into a short feature vector using the best practices we found. The SCDA is unsupervised, using no image label or bounding box annotation. Experiments on six fine-grained data sets confirm the effectiveness of the SCDA for fine-grained image retrieval. Besides, visualization of the SCDA features shows that they correspond to visual attributes (even subtle ones), which might explain SCDA’s high-mean average precision in fine-grained retrieval. Moreover, on general image retrieval data sets, the SCDA achieves comparable retrieval results with the state-of-the-art general image retrieval approaches.

Journal ArticleDOI
TL;DR: A new dataset of acceleration samples acquired with an Android smartphone designed for human activity recognition and fall detection is presented and shows that the presence of samples of the same subject both in the training and in the test datasets, increases the performance of the classifiers regardless of the feature vector used.
Abstract: Smartphones, smartwatches, fitness trackers, and ad-hoc wearable devices are being increasingly used to monitor human activities. Data acquired by the hosted sensors are usually processed by machine-learning-based algorithms to classify human activities. The success of those algorithms mostly depends on the availability of training (labeled) data that, if made publicly available, would allow researchers to make objective comparisons between techniques. Nowadays, there are only a few publicly available data sets, which often contain samples from subjects with too similar characteristics, and very often lack specific information so that is not possible to select subsets of samples according to specific criteria. In this article, we present a new dataset of acceleration samples acquired with an Android smartphone designed for human activity recognition and fall detection. The dataset includes 11,771 samples of both human activities and falls performed by 30 subjects of ages ranging from 18 to 60 years. Samples are divided in 17 fine grained classes grouped in two coarse grained classes: one containing samples of 9 types of activities of daily living (ADL) and the other containing samples of 8 types of falls. The dataset has been stored to include all the information useful to select samples according to different criteria, such as the type of ADL performed, the age, the gender, and so on. Finally, the dataset has been benchmarked with four different classifiers and with two different feature vectors. We evaluated four different classification tasks: fall vs. no fall, 9 activities, 8 falls, 17 activities and falls. For each classification task, we performed a 5-fold cross-validation (i.e., including samples from all the subjects in both the training and the test dataset) and a leave-one-subject-out cross-validation (i.e., the test data include the samples of a subject only, and the training data, the samples of all the other subjects). Regarding the classification tasks, the major findings can be summarized as follows: (i) it is quite easy to distinguish between falls and ADLs, regardless of the classifier and the feature vector selected. Indeed, these classes of activities present quite different acceleration shapes that simplify the recognition task; (ii) on average, it is more difficult to distinguish between types of falls than between types of activities, regardless of the classifier and the feature vector selected. This is due to the similarity between the acceleration shapes of different kinds of falls. On the contrary, ADLs acceleration shapes present differences except for a small group. Finally, the evaluation shows that the presence of samples of the same subject both in the training and in the test datasets, increases the performance of the classifiers regardless of the feature vector used. This happens because each human subject differs from other subjects in performing activities even if she shares with them the same physical characteristics.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: This NAN is trained with a standard classification or verification loss without any extra supervision signal, and it is found that it automatically learns to advocate high-quality face images while repelling low-quality ones such as blurred, occluded and improperly exposed faces.
Abstract: This paper presents a Neural Aggregation Network (NAN) for video face recognition. The network takes a face video or face image set of a person with a variable number of face images as its input, and produces a compact, fixed-dimension feature representation for recognition. The whole network is composed of two modules. The feature embedding module is a deep Convolutional Neural Network (CNN) which maps each face image to a feature vector. The aggregation module consists of two attention blocks which adaptively aggregate the feature vectors to form a single feature inside the convex hull spanned by them. Due to the attention mechanism, the aggregation is invariant to the image order. Our NAN is trained with a standard classification or verification loss without any extra supervision signal, and we found that it automatically learns to advocate high-quality face images while repelling low-quality ones such as blurred, occluded and improperly exposed faces. The experiments on IJB-A, YouTube Face, Celebrity-1000 video face recognition benchmarks show that it consistently outperforms naive aggregation methods and achieves the state-of-the-art accuracy.

Proceedings ArticleDOI
01 Oct 2017
TL;DR: Experimental results show the proposed CNN+LSTM architecture for camera pose regression for indoor and outdoor scenes outperforms existing deep architectures, and can localize images in hard conditions, where classic SIFT-based methods fail.
Abstract: In this work we propose a new CNN+LSTM architecture for camera pose regression for indoor and outdoor scenes. CNNs allow us to learn suitable feature representations for localization that are robust against motion blur and illumination changes. We make use of LSTM units on the CNN output, which play the role of a structured dimensionality reduction on the feature vector, leading to drastic improvements in localization performance. We provide extensive quantitative comparison of CNN-based and SIFT-based localization methods, showing the weaknesses and strengths of each. Furthermore, we present a new large-scale indoor dataset with accurate ground truth from a laser scanner. Experimental results on both indoor and outdoor public datasets show our method outperforms existing deep architectures, and can localize images in hard conditions, e.g., in the presence of mostly textureless surfaces, where classic SIFT-based methods fail.

Journal ArticleDOI
TL;DR: A novel structured matrix decomposition model with two structural regularizations that captures the image structure and enforces patches from the same object to have similar saliency values, and a Laplacian regularization that enlarges the gaps between salient objects and the background in feature space is proposed.
Abstract: Low-rank recovery models have shown potential for salient object detection, where a matrix is decomposed into a low-rank matrix representing image background and a sparse matrix identifying salient objects. Two deficiencies, however, still exist. First, previous work typically assumes the elements in the sparse matrix are mutually independent, ignoring the spatial and pattern relations of image regions. Second, when the low-rank and sparse matrices are relatively coherent, e.g., when there are similarities between the salient objects and background or when the background is complicated, it is difficult for previous models to disentangle them. To address these problems, we propose a novel structured matrix decomposition model with two structural regularizations: (1) a tree-structured sparsity-inducing regularization that captures the image structure and enforces patches from the same object to have similar saliency values, and (2) a Laplacian regularization that enlarges the gaps between salient objects and the background in feature space. Furthermore, high-level priors are integrated to guide the matrix decomposition and boost the detection. We evaluate our model for salient object detection on five challenging datasets including single object, multiple objects and complex scene images, and show competitive results as compared with 24 state-of-the-art methods in terms of seven performance metrics.

Journal ArticleDOI
TL;DR: A convolutional neural network is employed for the embedding task of learning an expressive molecular representation by treating molecules as undirected graphs with attributed nodes and edges, and preserves molecule-level spatial information that significantly enhances model performance.
Abstract: The task of learning an expressive molecular representation is central to developing quantitative structure–activity and property relationships. Traditional approaches rely on group additivity rules, empirical measurements or parameters, or generation of thousands of descriptors. In this paper, we employ a convolutional neural network for this embedding task by treating molecules as undirected graphs with attributed nodes and edges. Simple atom and bond attributes are used to construct atom-specific feature vectors that take into account the local chemical environment using different neighborhood radii. By working directly with the full molecular graph, there is a greater opportunity for models to identify important features relevant to a prediction task. Unlike other graph-based approaches, our atom featurization preserves molecule-level spatial information that significantly enhances model performance. Our models learn to identify important features of atom clusters for the prediction of aqueous solubil...

Posted Content
TL;DR: This paper adopts a simpler, domain-agnostic approach to dataset augmentation, and works in the space of context vectors generated by sequence-to-sequence models, demonstrating a technique that is effective for both static and sequential data.
Abstract: Dataset augmentation, the practice of applying a wide array of domain-specific transformations to synthetically expand a training set, is a standard tool in supervised learning. While effective in tasks such as visual recognition, the set of transformations must be carefully designed, implemented, and tested for every new domain, limiting its re-use and generality. In this paper, we adopt a simpler, domain-agnostic approach to dataset augmentation. We start with existing data points and apply simple transformations such as adding noise, interpolating, or extrapolating between them. Our main insight is to perform the transformation not in input space, but in a learned feature space. A re-kindling of interest in unsupervised representation learning makes this technique timely and more effective. It is a simple proposal, but to-date one that has not been tested empirically. Working in the space of context vectors generated by sequence-to-sequence models, we demonstrate a technique that is effective for both static and sequential data.

Posted Content
TL;DR: In this paper, a multi-branched reconstruction network is proposed to disentangle and encode the three image factors into embedding features, which are then combined to re-compose the input image itself.
Abstract: Generating novel, yet realistic, images of persons is a challenging task due to the complex interplay between the different image factors, such as the foreground, background and pose information. In this work, we aim at generating such images based on a novel, two-stage reconstruction pipeline that learns a disentangled representation of the aforementioned image factors and generates novel person images at the same time. First, a multi-branched reconstruction network is proposed to disentangle and encode the three factors into embedding features, which are then combined to re-compose the input image itself. Second, three corresponding mapping functions are learned in an adversarial manner in order to map Gaussian noise to the learned embedding feature space, for each factor respectively. Using the proposed framework, we can manipulate the foreground, background and pose of the input image, and also sample new embedding features to generate such targeted manipulations, that provide more control over the generation process. Experiments on Market-1501 and Deepfashion datasets show that our model does not only generate realistic person images with new foregrounds, backgrounds and poses, but also manipulates the generated factors and interpolates the in-between states. Another set of experiments on Market-1501 shows that our model can also be beneficial for the person re-identification task.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed transfer learning approach for fault diagnosis with neural networks can improve the classification accuracy and reduce the training time comparing with the conventional neural network method when there are only a small amount of target data.
Abstract: Traditional machine learning algorithms have made great achievements in data-driven fault diagnosis. However, they assume that all the data must be in the same working condition and have the same distribution and feature space. They are not applicable for real-world working conditions, which often change with time, so the data are hard to obtain. In order to utilize data in different working conditions to improve the performance, this paper presents a transfer learning approach for fault diagnosis with neural networks. First, it learns characteristics from massive source data and adjusts the parameters of neural networks accordingly. Second, the structure of neural networks alters for the change of data distribution. In the same time, some parameters are transferred from source task to target task. Finally, the new model is trained by a small amount of target data in another working condition. The Case Western Reserve University bearing data set is used to validate the performance of the proposed transfer learning approach. Experimental results show that the proposed transfer learning approach can improve the classification accuracy and reduce the training time comparing with the conventional neural network method when there are only a small amount of target data.

Journal ArticleDOI
TL;DR: An elaborately designed deep hierarchical network, namely a contextual region-based convolutional neural network with multilayer fusion, for SAR ship detection, which is composed of a region proposal network (RPN) with high network resolution and an object detection network with contextual features.
Abstract: Synthetic aperture radar (SAR) ship detection has been playing an increasingly essential role in marine monitoring in recent years. The lack of detailed information about ships in wide swath SAR imagery poses difficulty for traditional methods in exploring effective features for ship discrimination. Being capable of feature representation, deep neural networks have achieved dramatic progress in object detection recently. However, most of them suffer from the missing detection of small-sized targets, which means that few of them are able to be employed directly in SAR ship detection tasks. This paper discloses an elaborately designed deep hierarchical network, namely a contextual region-based convolutional neural network with multilayer fusion, for SAR ship detection, which is composed of a region proposal network (RPN) with high network resolution and an object detection network with contextual features. Instead of using low-resolution feature maps from a single layer for proposal generation in a RPN, the proposed method employs an intermediate layer combined with a downscaled shallow layer and an up-sampled deep layer to produce region proposals. In the object detection network, the region proposals are projected onto multiple layers with region of interest (ROI) pooling to extract the corresponding ROI features and contextual features around the ROI. After normalization and rescaling, they are subsequently concatenated into an integrated feature vector for final outputs. The proposed framework fuses the deep semantic and shallow high-resolution features, improving the detection performance for small-sized ships. The additional contextual features provide complementary information for classification and help to rule out false alarms. Experiments based on the Sentinel-1 dataset, which contains twenty-seven SAR images with 7986 labeled ships, verify that the proposed method achieves an excellent performance in SAR ship detection.

Journal ArticleDOI
TL;DR: A discriminative deep multi-metric learning method to jointly learn multiple neural networks, under which the correlation of different features of each sample is maximized, and the distance of each positive pair is reduced and that of each negative pair is enlarged.
Abstract: This paper presents a new discriminative deep metric learning (DDML) method for face and kinship verification in wild conditions. While metric learning has achieved reasonably good performance in face and kinship verification, most existing metric learning methods aim to learn a single Mahalanobis distance metric to maximize the inter-class variations and minimize the intra-class variations, which cannot capture the nonlinear manifold where face images usually lie on. To address this, we propose a DDML method to train a deep neural network to learn a set of hierarchical nonlinear transformations to project face pairs into the same latent feature space, under which the distance of each positive pair is reduced and that of each negative pair is enlarged. To better use the commonality of multiple feature descriptors to make all the features more robust for face and kinship verification, we develop a discriminative deep multi-metric learning method to jointly learn multiple neural networks, under which the correlation of different features of each sample is maximized, and the distance of each positive pair is reduced and that of each negative pair is enlarged. Extensive experimental results show that our proposed methods achieve the acceptable results in both face and kinship verification.