scispace - formally typeset
Search or ask a question

Showing papers on "Feature (machine learning) published in 2020"


Journal ArticleDOI
TL;DR: An explanation method for trees is presented that enables the computation of optimal local explanations for individual predictions, and the authors demonstrate their method on three medical datasets.
Abstract: Tree-based machine learning models such as random forests, decision trees and gradient boosted trees are popular nonlinear predictive models, yet comparatively little attention has been paid to explaining their predictions. Here we improve the interpretability of tree-based models through three main contributions. (1) A polynomial time algorithm to compute optimal explanations based on game theory. (2) A new type of explanation that directly measures local feature interaction effects. (3) A new set of tools for understanding global model structure based on combining many local explanations of each prediction. We apply these tools to three medical machine learning problems and show how combining many high-quality local explanations allows us to represent global structure while retaining local faithfulness to the original model. These tools enable us to (1) identify high-magnitude but low-frequency nonlinear mortality risk factors in the US population, (2) highlight distinct population subgroups with shared risk characteristics, (3) identify nonlinear interaction effects among risk factors for chronic kidney disease and (4) monitor a machine learning model deployed in a hospital by identifying which features are degrading the model’s performance over time. Given the popularity of tree-based machine learning models, these improvements to their interpretability have implications across a broad set of domains. Tree-based machine learning models are widely used in domains such as healthcare, finance and public services. The authors present an explanation method for trees that enables the computation of optimal local explanations for individual predictions, and demonstrate their method on three medical datasets.

2,548 citations


Journal ArticleDOI
TL;DR: This paper proposes pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset, and investigates the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks.
Abstract: Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification, music classification, speech emotion classification and sound event detection. Recently, neural networks have been applied to tackle audio pattern recognition problems. However, previous systems are built on specific datasets with limited durations. Recently, in computer vision and natural language processing, systems pretrained on large-scale datasets have generalized well to several tasks. However, there is limited research on pretraining systems on large-scale datasets for audio pattern recognition. In this paper, we propose pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset. These PANNs are transferred to other audio related tasks. We investigate the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks. We propose an architecture called Wavegram-Logmel-CNN using both log-mel spectrogram and waveform as input feature. Our best PANN system achieves a state-of-the-art mean average precision (mAP) of 0.439 on AudioSet tagging, outperforming the best previous system of 0.392. We transfer PANNs to six audio pattern recognition tasks, and demonstrate state-of-the-art performance in several of those tasks. We have released the source code and pretrained models of PANNs: https://github.com/qiuqiangkong/audioset_tagging_cnn .

560 citations


Journal ArticleDOI
TL;DR: To perform parameter optimization and feature selection simultaneously for SVM, an improved whale optimization algorithm (CMWOA), which combines chaotic and multi-swarm strategies is proposed, which significantly outperformed all the other competitors in terms of classification performance and feature subset size.

362 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: This work proposes a novel Adaptive Curriculum Learning loss (CurricularFace) that embeds the idea of curriculum learning into the loss function to achieve a novel training strategy for deep face recognition, which mainly addresses easy samples in the early training stage and hard ones in the later stage.
Abstract: As an emerging topic in face recognition, designing margin-based loss functions can increase the feature margin between different classes for enhanced discriminability. More recently, the idea of mining-based strategies is adopted to emphasize the misclassified samples, achieving promising results. However, during the entire training process, the prior methods either do not explicitly emphasize the sample based on its importance that renders the hard samples not fully exploited; or explicitly emphasize the effects of semi-hard/hard samples even at the early training stage that may lead to convergence issue. In this work, we propose a novel Adaptive Curriculum Learning loss (CurricularFace) that embeds the idea of curriculum learning into the loss function to achieve a novel training strategy for deep face recognition, which mainly addresses easy samples in the early training stage and hard ones in the later stage. Specifically, our CurricularFace adaptively adjusts the relative importance of easy and hard samples during different training stages. In each stage, different samples are assigned with different importance according to their corresponding difficultness. Extensive experimental results on popular benchmarks demonstrate the superiority of our CurricularFace over the state-of-the-art competitors.

350 citations


Proceedings Article
02 Oct 2020
TL;DR: It is argued that an important aspect of contrastive learning, i.e., the effect of hard negatives, has so far been neglected and proposed hard negative mixing strategies at the feature level, that can be computed on-the-fly with a minimal computational overhead.
Abstract: Contrastive learning has become a key component of self-supervised learning approaches for computer vision. By learning to embed two augmented versions of the same image close to each other and to push the embeddings of different images apart, one can train highly transferable visual representations. As revealed by recent studies, heavy data augmentation and large sets of negatives are both crucial in learning such representations. At the same time, data mixing strategies either at the image or the feature level improve both supervised and semi-supervised learning by synthesizing novel examples, forcing networks to learn more robust features. In this paper, we argue that an important aspect of contrastive learning, i.e., the effect of hard negatives, has so far been neglected. To get more meaningful negative samples, current top contrastive self-supervised learning approaches either substantially increase the batch sizes, or keep very large memory banks; increasing the memory size, however, leads to diminishing returns in terms of performance. We therefore start by delving deeper into a top-performing framework and show evidence that harder negatives are needed to facilitate better and faster learning. Based on these observations, and motivated by the success of data mixing, we propose hard negative mixing strategies at the feature level, that can be computed on-the-fly with a minimal computational overhead. We exhaustively ablate our approach on linear classification, object detection and instance segmentation and show that employing our hard negative mixing procedure improves the quality of visual representations learned by a state-of-the-art self-supervised learning method.

311 citations


Proceedings ArticleDOI
01 Feb 2020
TL;DR: Convolutional Neural Networks employs a definitely algorithm of steps to follow including methods like Backpropagation, Convolutional Layers, Feature formation and Pooling.
Abstract: Before Convolutional Neural Networks gained popularity, computer recognition problems involved extracting features out of the data provided which was not adequately efficient or provided a high degree of accuracy. However in recent times, Convolutional Neural Networks have attempted to provide a higher level of efficiency and accuracy in all the fields in which it has been employed in most popular of which are Object Detection, Digit and Image Recognition. It employs a definitely algorithm of steps to follow including methods like Backpropagation, Convolutional Layers, Feature formation and Pooling. Also this article will also venture into use of various frameworks and tools that involve CNN model.

283 citations


Posted Content
TL;DR: In this article, representation self-challenging (RSC) is proposed to improve cross-domain generalization of CNNs by iteratively disabling the dominant features on the training data and forcing the network to activate remaining features that correlate with labels.
Abstract: Convolutional Neural Networks (CNN) conduct image classification by activating dominant features that correlated with labels. When the training and testing data are under similar distributions, their dominant features are similar, which usually facilitates decent performance on the testing data. The performance is nonetheless unmet when tested on samples from different distributions, leading to the challenges in cross-domain image classification. We introduce a simple training heuristic, Representation Self-Challenging (RSC), that significantly improves the generalization of CNN to the out-of-domain data. RSC iteratively challenges (discards) the dominant features activated on the training data, and forces the network to activate remaining features that correlates with labels. This process appears to activate feature representations applicable to out-of-domain data without prior knowledge of new domain and without learning extra network parameters. We present theoretical properties and conditions of RSC for improving cross-domain generalization. The experiments endorse the simple, effective and architecture-agnostic nature of our RSC method.

272 citations


Proceedings Article
06 Dec 2020
TL;DR: SwAV as discussed by the authors uses a "swapped" prediction mechanism where they predict the cluster assignment of a view from the representation of another view, instead of comparing features directly as in contrastive learning.
Abstract: Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, our method simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or "views") of the same image, instead of comparing features directly as in contrastive learning. Simply put, we use a "swapped" prediction mechanism where we predict the cluster assignment of a view from the representation of another view. Our method can be trained with large and small batches and can scale to unlimited amounts of data. Compared to previous contrastive methods, our method is more memory efficient since it does not require a large memory bank or a special momentum network. In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements much. We validate our findings by achieving 75.3% top-1 accuracy on ImageNet with ResNet-50, as well as surpassing supervised pretraining on all the considered transfer tasks.

261 citations


Journal ArticleDOI
TL;DR: The asymmetric and unsupervised FC-SAE can extract optimal non-linear features from environmental factors successfully, outperforms some conventional machine learning methods, and is promising for LSP.
Abstract: The environmental factors of landslide susceptibility are generally uncorrelated or non-linearly correlated, resulting in the limited prediction performances of conventional machine learning methods for landslide susceptibility prediction (LSP). Deep learning methods can exploit low-level features and high-level representations of information from environmental factors. In this paper, a novel deep learning–based algorithm, the fully connected spare autoencoder (FC-SAE), is proposed for LSP. The FC-SAE consists of four steps: raw feature dropout in input layers, a sparse feature encoder in hidden layers, sparse feature extraction in output layers, and classification and prediction. The Sinan County of Guizhou Province in China, with a total of 23,195 landslide grid cells (306 recorded landslides) and 23,195 randomly selected non-landslide grid cells, was used as study case. The frequency ratio values of 27 environmental factors were taken as the input variables of FC-SAE. All 46,390 landslide and non-landslide grid cells were randomly divided into a training dataset (70%) and a test dataset (30%). By analyzing real landslide/non-landslide data, the performances of the FC-SAE and two other conventional machine learning methods, support vector machine (SVM) and back-propagation neural network (BPNN), were compared. The results show that the prediction rate and total accuracies of the FC-SAE are 0.854 and 85.2% which are higher than those of the SVM-only (0.827 and 81.56%) and BPNN (0.819 and 80.86%), respectively. In conclusion, the asymmetric and unsupervised FC-SAE can extract optimal non-linear features from environmental factors successfully, outperforms some conventional machine learning methods, and is promising for LSP.

233 citations


Journal ArticleDOI
TL;DR: The classification accuracy of the subject-independent (or calibration-free) model outperforms that of subject-dependent models using various methods [common spatial pattern (CSP), common spatiospectral pattern (CSSP), filter bank CSP, and Bayesian spatio-spectral filter optimization (BSSFO)].
Abstract: For a brain–computer interface (BCI) system, a calibration procedure is required for each individual user before he/she can use the BCI. This procedure requires approximately 20–30 min to collect enough data to build a reliable decoder. It is, therefore, an interesting topic to build a calibration-free, or subject-independent, BCI. In this article, we construct a large motor imagery (MI)-based electroencephalography (EEG) database and propose a subject-independent framework based on deep convolutional neural networks (CNNs). The database is composed of 54 subjects performing the left- and right-hand MI on two different days, resulting in 21 600 trials for the MI task. In our framework, we formulated the discriminative feature representation as a combination of the spectral–spatial input embedding the diversity of the EEG signals, as well as a feature representation learned from the CNN through a fusion technique that integrates a variety of discriminative brain signal patterns. To generate spectral–spatial inputs, we first consider the discriminative frequency bands in an information-theoretic observation model that measures the power of the features in two classes. From discriminative frequency bands, spectral–spatial inputs that include the unique characteristics of brain signal patterns are generated and then transformed into a covariance matrix as the input to the CNN. In the process of feature representations, spectral–spatial inputs are individually trained through the CNN and then combined by a concatenation fusion technique. In this article, we demonstrate that the classification accuracy of our subject-independent (or calibration-free) model outperforms that of subject-dependent models using various methods [common spatial pattern (CSP), common spatiospectral pattern (CSSP), filter bank CSP (FBCSP), and Bayesian spatio-spectral filter optimization (BSSFO)].

229 citations


Proceedings Article
30 Apr 2020
TL;DR: This paper argues, and provides empirical evidence, that the success of these methods cannot be attributed to the properties of MI alone, and that they strongly depend on the inductive bias in both the choice of feature extractor architectures and the parametrization of the employed MI estimators.
Abstract: Many recent methods for unsupervised or self-supervised representation learning train feature extractors by maximizing an estimate of the mutual information (MI) between different views of the data. This comes with several immediate problems: For example, MI is notoriously hard to estimate, and using it as an objective for representation learning may lead to highly entangled representations due to its invariance under arbitrary invertible transformations. Nevertheless, these methods have been repeatedly shown to excel in practice. In this paper we argue, and provide empirical evidence, that the success of these methods cannot be attributed to the properties of MI alone, and that they strongly depend on the inductive bias in both the choice of feature extractor architectures and the parametrization of the employed MI estimators. Finally, we establish a connection to deep metric learning and argue that this interpretation may be a plausible explanation for the success of the recently introduced methods.

Journal ArticleDOI
08 Jun 2020
TL;DR: A bioinspired data fusion architecture that can perform human gesture recognition by integrating visual data with somatosensory data from skin-like stretchable strain sensors made from single-walled carbon nanotubes is reported.
Abstract: Gesture recognition using machine-learning methods is valuable in the development of advanced cybernetics, robotics and healthcare systems, and typically relies on images or videos. To improve recognition accuracy, such visual data can be combined with data from other sensors, but this approach, which is termed data fusion, is limited by the quality of the sensor data and the incompatibility of the datasets. Here, we report a bioinspired data fusion architecture that can perform human gesture recognition by integrating visual data with somatosensory data from skin-like stretchable strain sensors made from single-walled carbon nanotubes. The learning architecture uses a convolutional neural network for visual processing and then implements a sparse neural network for sensor data fusion and recognition at the feature level. Our approach can achieve a recognition accuracy of 100% and maintain recognition accuracy in non-ideal conditions where images are noisy and under- or over-exposed. We also show that our architecture can be used for robot navigation via hand gestures, with an error of 1.7% under normal illumination and 3.3% in the dark. A bioinspired machine-learning architecture can combine visual data with data from stretchable strain sensors to achieve human gesture recognition with high accuracy in complex environments.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: This work introduces Action Genome, a representation that decomposes actions into spatio-temporal scene graphs and demonstrates the utility of a hierarchical event decomposition by enabling few-shot action recognition, achieving 42.7% mAP using as few as 10 examples.
Abstract: Action recognition has typically treated actions and activities as monolithic events that occur in videos. However, there is evidence from Cognitive Science and Neuroscience that people actively encode activities into consistent hierarchical part structures. However, in Computer Vision, few explorations on representations that encode event partonomies have been made. Inspired by evidence that the prototypical unit of an event is an action-object interaction, we introduce Action Genome, a representation that decomposes actions into spatio-temporal scene graphs. Action Genome captures changes between objects and their pairwise relationships while an action occurs. It contains 10K videos with 0.4M objects and 1.7M visual relationships annotated. With Action Genome, we extend an existing action recognition model by incorporating scene graphs as spatio-temporal feature banks to achieve better performance on the Charades dataset. Next, by decomposing and learning the temporal changes in visual relationships that result in an action, we demonstrate the utility of a hierarchical event decomposition by enabling few-shot action recognition, achieving 42.7% mAP using as few as 10 examples. Finally, we benchmark existing scene graph models on the new task of spatio-temporal scene graph prediction.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: In this paper, a generic Temporal Pyramid Network (TPN) is proposed to capture action instances at various tempos, which can be flexibly integrated into 2D or 3D backbone networks in a plug and play manner.
Abstract: Visual tempo characterizes the dynamics and the temporal scale of an action. Modeling such visual tempos of different actions facilitates their recognition. Previous works often capture the visual tempo through sampling raw videos at multiple rates and constructing an input-level frame pyramid, which usually requires a costly multi-branch network to handle. In this work we propose a generic Temporal Pyramid Network (TPN) at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks in a plug-and-play manner. Two essential components of TPN, the source of features and the fusion of features, form a feature hierarchy for the backbone so that it can capture action instances at various tempos. TPN also shows consistent improvements over other challenging baselines on several action recognition datasets. Specifically, when equipped with TPN, the 3D ResNet-50 with dense sampling obtains a 2\% gain on the validation set of Kinetics-400. A further analysis also reveals that TPN gains most of its improvements on action classes that have large variances in their visual tempos, validating the effectiveness of TPN.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: Two learning methods are proposed that are easy to use and outperform existing deterministic methods as well as PFE on challenging unconstrained scenarios and help reducing the adverse effects of noisy samples and affects the feature learning.
Abstract: Modeling data uncertainty is important for noisy images, but seldom explored for face recognition. The pioneer work, PFE, considers uncertainty by modeling each face image embedding as a Gaussian distribution. It is quite effective. However, it uses fixed feature (mean of the Gaussian) from an existing model. It only estimates the variance and relies on an ad-hoc and costly metric. Thus, it is not easy to use. It is unclear how uncertainty affects feature learning. This work applies data uncertainty learning to face recognition, such that the feature (mean) and uncertainty (variance) are learnt simultaneously, for the first time. Two learning methods are proposed. They are easy to use and outperform existing deterministic methods as well as PFE on challenging unconstrained scenarios. We also provide insightful analysis on how incorporating uncertainty estimation helps reducing the adverse effects of noisy samples and affects the feature learning.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: In this paper, a semantics-guided neural network (SGN) is proposed for skeleton-based action recognition, which explicitly introduces the high level semantics of joints (joint type and frame index) into the network to enhance the feature representation capability.
Abstract: Skeleton-based human action recognition has attracted great interest thanks to the easy accessibility of the human skeleton data. Recently, there is a trend of using very deep feedforward neural networks to model the 3D coordinates of joints without considering the computational efficiency. In this paper, we propose a simple yet effective semantics-guided neural network (SGN) for skeleton-based action recognition. We explicitly introduce the high level semantics of joints (joint type and frame index) into the network to enhance the feature representation capability. In addition, we exploit the relationship of joints hierarchically through two modules, i.e., a joint-level module for modeling the correlations of joints in the same frame and a framelevel module for modeling the dependencies of frames by taking the joints in the same frame as a whole. A strong baseline is proposed to facilitate the study of this field. With an order of magnitude smaller model size than most previous works, SGN achieves the state-of-the-art performance on the NTU60, NTU120, and SYSU datasets.

Posted Content
TL;DR: Through a series of analyses on transferring to block-shuffled images, the effect of feature reuse from learning low-level statistics of data is separated and it is shown that some benefit of transfer learning comes from the latter.
Abstract: One desired capability for machines is the ability to transfer their knowledge of one domain to another where data is (usually) scarce. Despite ample adaptation of transfer learning in various deep learning applications, we yet do not understand what enables a successful transfer and which part of the network is responsible for that. In this paper, we provide new tools and analyses to address these fundamental questions. Through a series of analyses on transferring to block-shuffled images, we separate the effect of feature reuse from learning low-level statistics of data and show that some benefit of transfer learning comes from the latter. We present that when training from pre-trained weights, the model stays in the same basin in the loss landscape and different instances of such model are similar in feature space and close in parameter space.

Posted Content
TL;DR: This work proposes to train the first part of the circuit with the objective of maximally separating data classes in Hilbert space, a strategy it calls quantum metric learning, which provides a powerful analytic framework for quantum machine learning.
Abstract: Quantum classifiers are trainable quantum circuits used as machine learning models. The first part of the circuit implements a quantum feature map that encodes classical inputs into quantum states, embedding the data in a high-dimensional Hilbert space; the second part of the circuit executes a quantum measurement interpreted as the output of the model. Usually, the measurement is trained to distinguish quantum-embedded data. We propose to instead train the first part of the circuit---the embedding---with the objective of maximally separating data classes in Hilbert space, a strategy we call quantum metric learning. As a result, the measurement minimizing a linear classification loss is already known and depends on the metric used: for embeddings separating data using the l1 or trace distance, this is the Helstrom measurement, while for the l2 or Hilbert-Schmidt distance, it is a simple overlap measurement. This approach provides a powerful analytic framework for quantum machine learning and eliminates a major component in current models, freeing up more precious resources to best leverage the capabilities of near-term quantum information processors.

Journal ArticleDOI
TL;DR: Experimental results on the wheelset bearing dataset show that the proposed multiattention mechanism can significantly improve the discriminant feature representation, thus the MA1DCNN outperforms eight state-of-the-arts networks.
Abstract: Recently, deep-learning-based fault diagnosis methods have been widely studied for rolling bearings. However, these neural networks are lack of interpretability for fault diagnosis tasks. That is, how to understand and learn discriminant fault features from complex monitoring signals remains a great challenge. Considering this challenge, this article explores the use of the attention mechanism in fault diagnosis networks and designs attention module by fully considering characteristics of rolling bearing faults to enhance fault-related features and to ignore irrelevant features. Powered by the proposed attention mechanism, a multiattention one-dimensional convolutional neural network (MA1DCNN) is further proposed to diagnose wheelset bearing faults. The MA1DCNN can adaptively recalibrate features of each layer and can enhance the feature learning of fault impulses. Experimental results on the wheelset bearing dataset show that the proposed multiattention mechanism can significantly improve the discriminant feature representation, thus the MA1DCNN outperforms eight state-of-the-arts networks.

Journal ArticleDOI
TL;DR: The experiment result show that the SS-BLS can achieve higher classification accuracy for different complex data, takes on fast operation speed and strong generalization ability.
Abstract: Broad Learning System (BLS) are widely used in many fields because of its strong feature extraction ability and high computational efficiency. However, the BLS is mainly used in supervised learning, which greatly limits the applicability of the BLS. And the obtained data is less labeled data, but is a large number of unlabeled data. Therefore, the BLS is extended based on the semi-supervised learning of manifold regularization framework to propose a semi-supervised broad learning system (SS-BLS). Firstly, the features are extracted from labeled and unlabeled data by building feature nodes and enhancement nodes. Then the manifold regularization framework is used to construct Laplacian matrix. Next, the feature nodes, enhancement nodes and Laplacian matrix are combined to construct the objective function, which is effectively solved by ridge regression in order to obtain the output coefficients. Finally, the validity of the SS-BLS is verified by three different complex data of G50C, MNIST, and NORB, respectively. The experiment result show that the SS-BLS can achieve higher classification accuracy for different complex data, takes on fast operation speed and strong generalization ability.

Proceedings ArticleDOI
19 Oct 2020
TL;DR: FEATHER, a computationally efficient algorithm to calculate a specific variant of characteristic functions defined on graph vertices where the probability weights of the characteristic function are defined as the transition probabilities of random walks is introduced.
Abstract: In this paper, we propose a flexible notion of characteristic functions defined on graph vertices to describe the distribution of vertex features at multiple scales. We introduce FEATHER, a computationally efficient algorithm to calculate a specific variant of these characteristic functions where the probability weights of the characteristic function are defined as the transition probabilities of random walks. We argue that features extracted by this procedure are useful for node level machine learning tasks. We discuss the pooling of these node representations, resulting in compact descriptors of graphs that can serve as features for graph classification algorithms. We analytically prove that FEATHER describes isomorphic graphs with the same representation and exhibits robustness to data corruption. Using the node feature characteristic functions we define parametric models where evaluation points of the functions are learned parameters of supervised classifiers. Experiments on real world large datasets show that our proposed algorithm creates high quality representations, performs transfer learning efficiently, exhibits robustness to hyperparameter changes and scales linearly with the input size.

Proceedings ArticleDOI
01 Feb 2020
TL;DR: This paper proposes a holistic deep learning-based activity recognition architecture, a convolutional neural network-long short-term memory network (CNN-LSTM), which improves the predictive accuracy of human activities from raw data but also reduces the complexity of the model while eliminating the need for advanced feature engineering.
Abstract: To understand human behavior and intrinsically anticipate human intentions, research into human activity recognition HAR) using sensors in wearable and handheld devices has intensified. The ability for a system to use as few resources as possible to recognize a user's activity from raw data is what many researchers are striving for. In this paper, we propose a holistic deep learning-based activity recognition architecture, a convolutional neural network-long short-term memory network (CNN-LSTM). This CNN-LSTM approach not only improves the predictive accuracy of human activities from raw data but also reduces the complexity of the model while eliminating the need for advanced feature engineering. The CNN-LSTM network is both spatially and temporally deep. Our proposed model achieves a 99% accuracy on the iSPL dataset, an internal dataset, and a 92 % accuracy on the UCI HAR public dataset. We also compared its performance against other approaches. It competes favorably against other deep neural network (DNN) architectures that have been proposed in the past and against machine learning models that rely on manually engineered feature datasets.

Posted Content
TL;DR: A generic Temporal Pyramid Network (TPN) at the feature-level is proposed, which can be flexibly integrated into 2D or 3D backbone networks in a plug-and-play manner and shows consistent improvements over other challenging baselines on several action recognition datasets.
Abstract: Visual tempo characterizes the dynamics and the temporal scale of an action. Modeling such visual tempos of different actions facilitates their recognition. Previous works often capture the visual tempo through sampling raw videos at multiple rates and constructing an input-level frame pyramid, which usually requires a costly multi-branch network to handle. In this work we propose a generic Temporal Pyramid Network (TPN) at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks in a plug-and-play manner. Two essential components of TPN, the source of features and the fusion of features, form a feature hierarchy for the backbone so that it can capture action instances at various tempos. TPN also shows consistent improvements over other challenging baselines on several action recognition datasets. Specifically, when equipped with TPN, the 3D ResNet-50 with dense sampling obtains a 2% gain on the validation set of Kinetics-400. A further analysis also reveals that TPN gains most of its improvements on action classes that have large variances in their visual tempos, validating the effectiveness of TPN.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: This work proposes an attribute embedding technique that aligns each attribute-based feature with its attribute semantic vector, and compute a vector of attribute scores, for the presence of each attribute in an image, whose similarity with the true class semantic vector is maximized.
Abstract: We address the problem of fine-grained generalized zero-shot recognition of visually similar classes without training images for some classes. We propose a dense attribute-based attention mechanism that for each attribute focuses on the most relevant image regions, obtaining attribute-based features. Instead of aligning a global feature vector of an image with its associated class semantic vector, we propose an attribute embedding technique that aligns each attribute-based feature with its attribute semantic vector. Hence, we compute a vector of attribute scores, for the presence of each attribute in an image, whose similarity with the true class semantic vector is maximized. Moreover, we adjust each attribute score using an attention mechanism over attributes to better capture the discriminative power of different attributes. To tackle the challenge of bias towards seen classes during testing, we propose a new self-calibration loss that adjusts the probability of unseen classes to account for the training bias. We conduct experiments on three popular datasets of CUB, SUN and AWA2 as well as the large-scale DeepFashion dataset, showing that our model significantly improves the state of the art.

Journal ArticleDOI
TL;DR: Two novel lightweight networks are proposed that can obtain higher recognition precision while preserving less trainable parameters in the models and can be useful when deploying deep convolutional neural networks (CNNs) on mobile embedded devices.
Abstract: Deeper neural networks have achieved great results in the field of computer vision and have been successfully applied to tasks such as traffic sign recognition. However, as traffic sign recognition systems are often deployed in resource-constrained environments, it is critical for the network design to be slim and accurate in these instances. Accordingly, in this paper, we propose two novel lightweight networks that can obtain higher recognition precision while preserving less trainable parameters in the models. Knowledge distillation transfers the knowledge in a trained model, called the teacher network, to a smaller model, called the student network. Moreover, to improve the accuracy of traffic sign recognition, we also implement a new module in our teacher network that combines two streams of feature channels with dense connectivity. To enable easy deployment on mobile devices, our student network is a simple end-to-end architecture containing five convolutional layers and a fully connected layer. Furthermore, by referring to the values of batch normalization (BN) scaling factors towards zero to identify insignificant channels, we prune redundant channels from the student network, yielding a compact model with accuracy comparable to that of more complex models. Our teacher network exhibited an accuracy rate of 93.16% when trained and tested on the CIFAR-10 general dataset. Using the knowledge of our teacher network, we train the student network on the GTSRB and BTSC traffic sign datasets. Thus, our student model uses only 0.8 million parameters while still achieving accuracy of 99.61% and 99.13% respectively on both datasets. All experimental results show that our lightweight networks can be useful when deploying deep convolutional neural networks (CNNs) on mobile embedded devices.

Posted Content
TL;DR: It is shown that optimizing for cross-modal discrimination, rather than within-modAL discrimination, is important to learn good representations from video and audio, and this self-supervised learning approach achieves highly competitive performance when finetuned on action recognition tasks.
Abstract: We present a self-supervised learning approach to learn audio-visual representations from video and audio. Our method uses contrastive learning for cross-modal discrimination of video from audio and vice versa. We show that optimizing for cross-modal discrimination, rather than within-modal discrimination, is important to learn good representations from video and audio. With this simple but powerful insight, our method achieves state-of-the-art results when finetuned on action recognition tasks. While recent work in contrastive learning defines positive and negative samples as individual instances, we generalize this definition by exploring cross-modal agreement. We group together multiple instances as positives by measuring their similarity in both the video and the audio feature spaces. Cross-modal agreement creates better positive and negative sets, and allows us to calibrate visual similarities by seeking within-modal discrimination of positive instances.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: This work proposes a novel system for unsupervised skeleton-based action recognition based on an encoder-decoder recurrent neural network, where the encoder learns a separable feature representation within its hidden states formed by training the model to perform the prediction task.
Abstract: We propose a novel system for unsupervised skeleton-based action recognition. Given inputs of body-keypoints sequences obtained during various movements, our system associates the sequences with actions. Our system is based on an encoder-decoder recurrent neural network, where the encoder learns a separable feature representation within its hidden states formed by training the model to perform the prediction task. We show that according to such unsupervised training, the decoder and the encoder self-organize their hidden states into a feature space which clusters similar movements into the same cluster and distinct movements into distant clusters. Current state-of-the-art methods for action recognition are strongly supervised, i.e., rely on providing labels for training. Unsupervised methods have been proposed, however, they require camera and depth inputs (RGB+D) at each time step. In contrast, our system is fully unsupervised, does not require action labels at any stage and can operate with body-keypoints input only. Furthermore, the method can perform on various dimensions of body-keypoints (2D or 3D) and can include additional cues describing movements. We evaluate our system on three action recognition benchmarks with different numbers of actions and examples. Our results outperform prior unsupervised skeleton-based methods, unsupervised RGB+D based methods on cross-view tests and while being unsupervised have similar performance to supervised skeleton-based action recognition.

Journal ArticleDOI
03 Apr 2020
TL;DR: This paper designs a random group permutation method combined with multi-layer convolutional networks to learn the low-dimensional features from multivariate time series data and proposes a novel MTSC model with an attentional prototype network to take the strengths of both traditional and deep learning based approaches.
Abstract: With the advance of sensor technologies, the Multivariate Time Series classification (MTSC) problem, perhaps one of the most essential problems in the time series data mining domain, has continuously received a significant amount of attention in recent decades. Traditional time series classification approaches based on Bag-of-Patterns or Time Series Shapelet have difficulty dealing with the huge amounts of feature candidates generated in high-dimensional multivariate data but have promising performance even when the training set is small. In contrast, deep learning based methods can learn low-dimensional features efficiently but suffer from a shortage of labelled data. In this paper, we propose a novel MTSC model with an attentional prototype network to take the strengths of both traditional and deep learning based approaches. Specifically, we design a random group permutation method combined with multi-layer convolutional networks to learn the low-dimensional features from multivariate time series data. To handle the issue of limited training labels, we propose a novel attentional prototype network to train the feature representation based on their distance to class prototypes with inadequate data labels. In addition, we extend our model into its semi-supervised setting by utilizing the unlabeled data. Extensive experiments on 18 datasets in a public UEA Multivariate time series archive with eight state-of-the-art baseline methods exhibit the effectiveness of the proposed model.

Journal ArticleDOI
TL;DR: A new fault diagnosis method is presented, which generalizes convolutional neural network (CNN) to TL scenario and gets the best performance for fault classification.
Abstract: Fault diagnosis is very important for condition based maintenance. Recently, deep learning models are introduced to learn hierarchical representations from raw data instead of using hand-crafted features, which exhibit excellent performance. The success of current deep learning lies in: 1) the training (source domain) and testing (target domain) datasets are from the same feature distribution; 2) Enough labeled data with fault information exist. However, because the machine operates under a non-stationary working condition, the trained model built on the source domain can not be directly applied on the target domain. Moreover, since no sufficient labeled or even unlabeled data are available in target domain, collecting the labeled data and building the model from scratch is time-consuming and expensive. Motivated by transfer learning (TL), we present a new fault diagnosis method, which generalizes convolutional neural network (CNN) to TL scenario. Two layers with regard to task-specific features are adapted in a layer-wise way to regularize the parameters of CNN. What's more, the domain loss is calculated by a linear combination of multiple Gaussian kernels so that the ability of adaptation is enhanced compared to single kernel. Through these two means, the distribution discrepancy is reduced and the transferable features are learned. The proposed method is validated by transfer fault diagnosis experiments. Compared to CNN without domain adaptation and shallow transfer learning methods, the proposed method gets the best performance for fault classification.

Journal ArticleDOI
TL;DR: Experimental results, conducted using three large-scale benchmark data sets, demonstrate that the newly proposed SCCov network exhibits very competitive or superior classification performance when compared with the current state-of-the-art RSSC techniques, using a much lower amount of parameters.
Abstract: This paper proposes a novel end-to-end learning model, called skip-connected covariance (SCCov) network, for remote sensing scene classification (RSSC) The innovative contribution of this paper is to embed two novel modules into the traditional convolutional neural network (CNN) model, ie, skip connections and covariance pooling The advantages of newly developed SCCov are twofold First, by means of the skip connections, the multi-resolution feature maps produced by the CNN are combined together, which provides important benefits to address the presence of large-scale variance in RSSC data sets Second, by using covariance pooling, we can fully exploit the second-order information contained in such multi-resolution feature maps This allows the CNN to achieve more representative feature learning when dealing with RSSC problems Experimental results, conducted using three large-scale benchmark data sets, demonstrate that our newly proposed SCCov network exhibits very competitive or superior classification performance when compared with the current state-of-the-art RSSC techniques, using a much lower amount of parameters Specifically, our SCCov only needs 10% of the parameters used by its counterparts