scispace - formally typeset
Search or ask a question

Showing papers on "Dimensionality reduction published in 2019"


Posted Content
TL;DR: This paper proposes an Efficient Channel Attention (ECA) module, which only involves a handful of parameters while bringing clear performance gain, and develops a method to adaptively select kernel size of 1D convolution, determining coverage of local cross-channel interaction.
Abstract: Recently, channel attention mechanism has demonstrated to offer great potential in improving the performance of deep convolutional neural networks (CNNs). However, most existing methods dedicate to developing more sophisticated attention modules for achieving better performance, which inevitably increase model complexity. To overcome the paradox of performance and complexity trade-off, this paper proposes an Efficient Channel Attention (ECA) module, which only involves a handful of parameters while bringing clear performance gain. By dissecting the channel attention module in SENet, we empirically show avoiding dimensionality reduction is important for learning channel attention, and appropriate cross-channel interaction can preserve performance while significantly decreasing model complexity. Therefore, we propose a local cross-channel interaction strategy without dimensionality reduction, which can be efficiently implemented via $1D$ convolution. Furthermore, we develop a method to adaptively select kernel size of $1D$ convolution, determining coverage of local cross-channel interaction. The proposed ECA module is efficient yet effective, e.g., the parameters and computations of our modules against backbone of ResNet50 are 80 vs. 24.37M and 4.7e-4 GFLOPs vs. 3.86 GFLOPs, respectively, and the performance boost is more than 2% in terms of Top-1 accuracy. We extensively evaluate our ECA module on image classification, object detection and instance segmentation with backbones of ResNets and MobileNetV2. The experimental results show our module is more efficient while performing favorably against its counterparts.

1,048 citations


Journal ArticleDOI
TL;DR: A brief overview of text classification algorithms is discussed in this article, where different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods are discussed, and the limitations of each technique and their application in real-world problems are discussed.
Abstract: In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understand complex models and non-linear relationships within data. However, finding suitable structures, architectures, and techniques for text classification is a challenge for researchers. In this paper, a brief overview of text classification algorithms is discussed. This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods. Finally, the limitations of each technique and their application in real-world problems are discussed.

624 citations


Journal ArticleDOI
TL;DR: An overview of text classification algorithms is discussed, which covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods.
Abstract: In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understand complex models and non-linear relationships within data. However, finding suitable structures, architectures, and techniques for text classification is a challenge for researchers. In this paper, a brief overview of text classification algorithms is discussed. This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods. Finally, the limitations of each technique and their application in the real-world problem are discussed.

612 citations


Journal ArticleDOI
TL;DR: A protocol is introduced to help avoid common shortcomings of t-SNE, for example, enabling preservation of the global structure of the data.
Abstract: Single-cell transcriptomics yields ever growing data sets containing RNA expression levels for thousands of genes from up to millions of cells. Common data analysis pipelines include a dimensionality reduction step for visualising the data in two dimensions, most frequently performed using t-distributed stochastic neighbour embedding (t-SNE). It excels at revealing local structure in high-dimensional data, but naive applications often suffer from severe shortcomings, e.g. the global structure of the data is not represented accurately. Here we describe how to circumvent such pitfalls, and develop a protocol for creating more faithful t-SNE visualisations. It includes PCA initialisation, a high learning rate, and multi-scale similarity kernels; for very large data sets, we additionally use exaggeration and downsampling-based initialisation. We use published single-cell RNA-seq data sets to demonstrate that this protocol yields superior results compared to the naive application of t-SNE. t-SNE is widely used for dimensionality reduction and visualization of high-dimensional single-cell data. Here, the authors introduce a protocol to help avoid common shortcomings of t-SNE, for example, enabling preservation of the global structure of the data.

576 citations


Journal ArticleDOI
TL;DR: Simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance are proposed, which outperform the current practice in a downstream clustering assessment using ground truth datasets.
Abstract: Single-cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero inflation. Current normalization procedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We propose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform the current practice in a downstream clustering assessment using ground truth datasets.

287 citations


Journal ArticleDOI
TL;DR: In this article, an automated toolkit for t-SNE parameter selection was developed that utilizes Kullback-Leibler divergence evaluation in real time to tailor the early exaggeration and overall number of gradient descent iterations in a dataset-specific manner.
Abstract: Accurate and comprehensive extraction of information from high-dimensional single cell datasets necessitates faithful visualizations to assess biological populations. A state-of-the-art algorithm for non-linear dimension reduction, t-SNE, requires multiple heuristics and fails to produce clear representations of datasets when millions of cells are projected. We develop opt-SNE, an automated toolkit for t-SNE parameter selection that utilizes Kullback-Leibler divergence evaluation in real time to tailor the early exaggeration and overall number of gradient descent iterations in a dataset-specific manner. The precise calibration of early exaggeration together with opt-SNE adjustment of gradient descent learning rate dramatically improves computation time and enables high-quality visualization of large cytometry and transcriptomics datasets, overcoming limitations of analysis tools with hard-coded parameters that often produce poorly resolved or misleading maps of fluorescent and mass cytometry data. In summary, opt-SNE enables superior data resolution in t-SNE space and thereby more accurate data interpretation. Visualisation tools that use dimensionality reduction, such as t-SNE, provide poor visualisation on large data sets of millions of observations. Here the authors present opt-SNE, that automatically finds data set-tailored parameters for t-SNE to optimise visualisation and improve analysis.

271 citations


Journal ArticleDOI
TL;DR: This paper proposes a generalized framework, named as transfer independently together (TIT), which learns multiple transformations, one for each domain (independently) to map data onto a shared latent space, where the domains are well aligned.
Abstract: Currently, unsupervised heterogeneous domain adaptation in a generalized setting, which is the most common scenario in real-world applications, is under insufficient exploration. Existing approaches either are limited to special cases or require labeled target samples for training. This paper aims to overcome these limitations by proposing a generalized framework, named as transfer independently together (TIT). Specifically, we learn multiple transformations, one for each domain (independently) , to map data onto a shared latent space, where the domains are well aligned. The multiple transformations are jointly optimized in a unified framework (together) by an effective formulation. In addition, to learn robust transformations, we further propose a novel landmark selection algorithm to reweight samples, i.e., increase the weight of pivot samples and decrease the weight of outliers. Our landmark selection is based on graph optimization. It focuses on sample geometric relationship rather than sample features. As a result, by abstracting feature vectors to graph vertices, only a simple and fast integer arithmetic is involved in our algorithm instead of matrix operations with float point arithmetic in existing approaches. At last, we effectively optimize our objective via a dimensionality reduction procedure. TIT is applicable to arbitrary sample dimensionality and does not need labeled target samples for training. Extensive evaluations on several standard benchmarks and large-scale datasets of image classification, text categorization and text-to-image recognition verify the superiority of our approach.

259 citations


Journal ArticleDOI
TL;DR: A regularization scheme is introduced that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.
Abstract: We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms. The goal is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. Since the learned representation is tuned to contain only phonetic content, we resort to using a high capacity WaveNet decoder to infer information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the kind of constraint that is applied to the latent representation. We compare three variants: a simple dimensionality reduction bottleneck, a Gaussian Variational Autoencoder (VAE), and a discrete Vector Quantized VAE (VQ-VAE). We analyze the quality of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately reconstruct individual spectrogram frames. Moreover, for discrete encodings extracted using the VQ-VAE, we measure the ease of mapping them to phonemes. We introduce a regularization scheme that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.

252 citations


Journal ArticleDOI
TL;DR: The proposed work, deploys filter and wrapper based method with firefly algorithm in the wrapper for selecting the features, and shows that 10 features are sufficient to detect the intrusion showing improved accuracy.

215 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed hybrid dimensionality reduction method with the ensemble of the base learners contributes more critical features and significantly outperforms individual approaches, achieving high accuracy and low false alarm rates.

200 citations


Journal ArticleDOI
TL;DR: This work proposes a hyperspectral image unsupervised classification framework based on robust manifold matrix factorization and its out-of-sample extension and designs a novel Augmented Lagrangian Method (ALM) based procedure to seek the local optimal solution of the proposed optimization.

Proceedings ArticleDOI
15 Jun 2019
TL;DR: NDDR-CNN as mentioned in this paper concatenates features with the same spatial resolution from different tasks according to their channel dimension and shows that the discriminative dimensionality reduction can be fulfilled by 1x1 Convolution, Batch Normalization, and Weight Decay in one CNN.
Abstract: In this paper, we propose a novel Convolutional Neural Network (CNN) structure for general-purpose multi-task learning (MTL), which enables automatic feature fusing at every layer from different tasks. This is in contrast with the most widely used MTL CNN structures which empirically or heuristically share features on some specific layers (e.g., share all the features except the last convolutional layer). The proposed layerwise feature fusing scheme is formulated by combining existing CNN components in a novel way, with clear mathematical interpretability as discriminative dimensionality reduction, which is referred to as Neural Discriminative Dimensionality Reduction (NDDR). Specifically, we first concatenate features with the same spatial resolution from different tasks according to their channel dimension. Then, we show that the discriminative dimensionality reduction can be fulfilled by 1x1 Convolution, Batch Normalization, and Weight Decay in one CNN. The use of existing CNN components ensures the end-to-end training and the extensibility of the proposed NDDR layer to various state-of-the-art CNN architectures in a "plug-and-play" manner. The detailed ablation analysis shows that the proposed NDDR layer is easy to train and also robust to different hyperparameters. Experiments on different task sets with various base network architectures demonstrate the promising performance and desirable generalizability of our proposed method. The code of our paper is available at https://github.com/ethanygao/NDDR-CNN.

Journal ArticleDOI
TL;DR: Experimental results show that the PCA method can effectively eliminate the feature correlation and realize the dimension reduction of the feature matrix, the BLS can take on better adaptability, faster computation speed, and higher classification accuracy, and the PABSFD method can efficiently and accurately obtain the fault diagnosis results.
Abstract: Traditional feature extraction methods are used to extract the features of signal to construct the fault feature matrix, which exists the complex structure, higher correlation, and redundancy. This will increase the complex fault classification and seriously affect the accuracy and efficiency of fault identification. In order to solve these problems, a new fault diagnosis (PABSFD) method based on the principal component analysis (PCA) and the broad learning system (BLS) is proposed for rotor system in this paper. In the proposed PABSFD method, the PCA with revealing the signal essence is used to reduce the dimension of the constructed feature matrix and decrease the linear feature correlation between data and eliminate the redundant attributes in order to obtain the low-dimensional feature matrix with retaining the essential features for the classification model. Then, the BLS with low time complexity and high classification accuracy is regarded as a classification model to realize the fault identification; it can efficiently accomplish the fault classification of rotor system. Finally, the actual vibration data of rotor system are selected to test and verify the effectiveness of the PABSFD method. The experimental results show that the PCA method can effectively eliminate the feature correlation and realize the dimension reduction of the feature matrix, the BLS can take on better adaptability, faster computation speed, and higher classification accuracy, and the PABSFD method can efficiently and accurately obtain the fault diagnosis results.

Journal ArticleDOI
TL;DR: A set of useful guidelines for practitioners specifying how to correctly perform dimensionality reduction, interpret its output, and communicate results are presented.
Abstract: Dimensionality reduction (DR) is frequently applied during the analysis of high-dimensional data. Both a means of denoising and simplification, it can be beneficial for the majority of modern biological datasets, in which it’s not uncommon to have hundreds or even millions of simultaneous measurements collected for a single sample. Because of “the curse of dimensionality,” many statistical methods lack power when applied to high-dimensional data. Even if the number of collected data points is large, they remain sparsely submerged in a voluminous high-dimensional space that is practically impossible to explore exhaustively (see chapter 12 [1]). By reducing the dimensionality of the data, you can often alleviate this challenging and troublesome phenomenon. Low-dimensional data representations that remove noise but retain the signal of interest can be instrumental in understanding hidden structures and patterns. Original high-dimensional data often contain measurements on uninformative or redundant variables. DR can be viewed as a method for latent feature extraction. It is also frequently used for data compression, exploration, and visualization. Although many DR techniques have been developed and implemented in standard data analytic pipelines, they are easy to misuse, and their results are often misinterpreted in practice. This article presents a set of useful guidelines for practitioners specifying how to correctly perform DR, interpret its output, and communicate results. Note that this is not a review article, and we recommend some important reviews in the references.

Journal ArticleDOI
TL;DR: A Multi-Class Combined performance metric is proposed to compare various multi-class and binary classification systems through incorporating FAR, DR, Accuracy, and class distribution parameters and a uniform distribution based balancing approach is developed to handle the imbalanced distribution of the minority class instances in the CICIDS2017 network intrusion dataset.
Abstract: The security of networked systems has become a critical universal issue that influences individuals, enterprises and governments. The rate of attacks against networked systems has increased dramatically, and the tactics used by the attackers are continuing to evolve. Intrusion detection is one of the solutions against these attacks. A common and effective approach for designing Intrusion Detection Systems (IDS) is Machine Learning. The performance of an IDS is significantly improved when the features are more discriminative and representative. This study uses two feature dimensionality reduction approaches: (i) Auto-Encoder (AE): an instance of deep learning, for dimensionality reduction, and (ii) Principle Component Analysis (PCA). The resulting low-dimensional features from both techniques are then used to build various classifiers such as Random Forest (RF), Bayesian Network, Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) for designing an IDS. The experimental findings with low-dimensional features in binary and multi-class classification show better performance in terms of Detection Rate (DR), F-Measure, False Alarm Rate (FAR), and Accuracy. This research effort is able to reduce the CICIDS2017 dataset’s feature dimensions from 81 to 10, while maintaining a high accuracy of 99.6% in multi-class and binary classification. Furthermore, in this paper, we propose a Multi-Class Combined performance metric C o m b i n e d M c with respect to class distribution to compare various multi-class and binary classification systems through incorporating FAR, DR, Accuracy, and class distribution parameters. In addition, we developed a uniform distribution based balancing approach to handle the imbalanced distribution of the minority class instances in the CICIDS2017 network intrusion dataset.

Journal ArticleDOI
TL;DR: A RF fingerprint identification method based on dimensional reduction and machine learning is proposed as a component of intrusion detection for resolving authentication security issues and improves security protection due to the introduction of RF fingerprinting.
Abstract: The access security of wireless devices is a serious challenge in present wireless network security. Radio frequency (RF) fingerprint recognition technology as an important non-password authentication technology attracts more and more attention, because of its full use of radio frequency characteristics that cannot be imitated to achieve certification. In this paper, a RF fingerprint identification method based on dimensional reduction and machine learning is proposed as a component of intrusion detection for resolving authentication security issues. We compare three kinds of dimensional reduction methods, which are the traditional PCA, RPCA and KPCA. And we take random forests, support vector machine, artificial neural network and grey correlation analysis into consideration to make decisions on the dimensional reduction data. Finally, we obtain the recognition system with the best performance. Using a combination of RPCA and random forests, we achieve 90% classification accuracy is achieved at SNR $$\ge $$ 10 dB when reduced dimension is 76. The proposed method improves wireless device authentication and improves security protection due to the introduction of RF fingerprinting.

Posted Content
02 Apr 2019
TL;DR: A novel methodology combining the benefits of correlation-based feature selection and bat algorithm with an ensemble classifier based on C4.5, Random Forest, and Forest by Penalizing Attributes, which can be able to classify both common and rare types of attacks with high accuracy and efficiency.
Abstract: Intrusion detection system (IDS) is one of extensively used techniques in a network topology to safeguard the integrity and availability of sensitive assets in the protected systems. Although many supervised and unsupervised learning approaches from the field of machine learning have been used to increase the efficacy of IDSs, it is still a problem for existing intrusion detection algorithms to achieve good performance. First, lots of redundant and irrelevant data in high-dimensional datasets interfere with the classification process of an IDS. Second, an individual classifier may not perform well in the detection of each type of attacks. Third, many models are built for stale datasets, making them less adaptable for novel attacks. Thus, we propose a new intrusion detection framework in this paper, and this framework is based on the feature selection and ensemble learning techniques. In the first step, a heuristic algorithm called CFS-BA is proposed for dimensionality reduction, which selects the optimal subset based on the correlation between features. Then, we introduce an ensemble approach that combines C4.5, Random Forest (RF), and Forest by Penalizing Attributes (Forest PA) algorithms. Finally, voting technique is used to combine the probability distributions of the base learners for attack recognition. The experimental results, using NSL-KDD, AWID, and CIC-IDS2017 datasets, reveal that the proposed CFS-BA-Ensemble method is able to exhibit better performance than other related and state of the art approaches under several metrics.

Journal ArticleDOI
TL;DR: The experimental results indicate that the AC, FAR, and timeliness of the CNN–IDS model are higher than those of traditional algorithms, therefore, the model has not only research significance but also practical value.
Abstract: With the popularity and development of network technology and the Internet, intrusion detection systems (IDSs), which can identify attacks, have been developed. Traditional intrusion detection algorithms typically employ mining association rules to identify intrusion behaviors. However, they fail to fully extract the characteristic information of user behaviors and encounter various problems, such as high false alarm rate (FAR), poor generalization capability, and poor timeliness. In this paper, we propose a network intrusion detection model based on a convolutional neural network-IDS (CNN-IDS). Redundant and irrelevant features in the network traffic data are first removed using different dimensionality reduction methods. Features of the dimensionality reduction data are automatically extracted using the CNN, and more effective information for identifying intrusion is extracted by supervised learning. To reduce the computational cost, we convert the original traffic vector format into an image format and use a standard KDD-CUP99 dataset to evaluate the performance of the proposed CNN model. The experimental results indicate that the AC, FAR, and timeliness of the CNN-IDS model are higher than those of traditional algorithms. Therefore, the model we propose has not only research significance but also practical value.

Journal ArticleDOI
TL;DR: A novel computational method named Ensemble of Decision Tree based MiRNA-Disease Association prediction (EDTMDA) is proposed, which innovatively built a computational framework integrating ensemble learning and dimensionality reduction.
Abstract: In recent years, increasing associations between microRNAs (miRNAs) and human diseases have been identified. Based on accumulating biological data, many computational models for potential miRNA-disease associations inference have been developed, which saves time and expenditure on experimental studies, making great contributions to researching molecular mechanism of human diseases and developing new drugs for disease treatment. In this paper, we proposed a novel computational method named Ensemble of Decision Tree based MiRNA-Disease Association prediction (EDTMDA), which innovatively built a computational framework integrating ensemble learning and dimensionality reduction. For each miRNA-disease pair, the feature vector was extracted by calculating the statistical measures, graph theoretical measures, and matrix factorization results for the miRNA and disease, respectively. Then multiple base learnings were built to yield many decision trees (DTs) based on random selection of negative samples and miRNA/disease features. Particularly, Principal Components Analysis was applied to each base learning to reduce feature dimensionality and hence remove the noise or redundancy. Average strategy was adopted for these DTs to get final association scores between miRNAs and diseases. In model performance evaluation, EDTMDA showed AUC of 0.9309 in global leave-one-out cross validation (LOOCV) and AUC of 0.8524 in local LOOCV. Additionally, AUC of 0.9192+/-0.0009 in 5-fold cross validation proved the model’s reliability and stability. Furthermore, three types of case studies for four human diseases were implemented. As a result, 94% (Esophageal Neoplasms), 86% (Kidney Neoplasms), 96% (Breast Neoplasms) and 88% (Carcinoma Hepatocellular) of top 50 predicted miRNAs were confirmed by experimental evidences in literature.

Journal ArticleDOI
TL;DR: A novel electricity price forecasting model based on a hybrid feature selector based on Grey Correlation Analysis and an integration of Kernel function and Principle Component Analysis is used in feature extraction process to realize the dimensionality reduction.
Abstract: Electricity price forecasting is a significant part of smart grid because it makes smart grid cost efficient. Nevertheless, existing methods for price forecasting may be difficult to handle with huge price data in the grid, since the redundancy from feature selection cannot be averted and an integrated infrastructure is also lacked for coordinating the procedures in electricity price forecasting. To solve such a problem, a novel electricity price forecasting model is developed. Specifically, three modules are integrated in the proposed model. First, by merging of Random Forest (RF) and Relief-F algorithm, we propose a hybrid feature selector based on Grey Correlation Analysis (GCA) to eliminate the feature redundancy. Second, an integration of Kernel function and Principle Component Analysis (KPCA) is used in feature extraction process to realize the dimensionality reduction. Finally, to forecast price classification, we put forward a differential evolution (DE) based Support Vector Machine (SVM) classifier. Our proposed electricity price forecasting model is realized via these three parts. Numerical results show that our proposal has superior performance than other methods.

Journal ArticleDOI
TL;DR: Experiments conducted on three widely-used hyperspectral image datasets demonstrate that the dimension-reduced features learned by the proposed IMR framework with respect to classification or recognition accuracy are superior to those of related state-of-the-art HDR approaches.
Abstract: Hyperspectral dimensionality reduction (HDR), an important preprocessing step prior to high-level data analysis, has been garnering growing attention in the remote sensing community. Although a variety of methods, both unsupervised and supervised models, have been proposed for this task, yet the discriminative ability in feature representation still remains limited due to the lack of a powerful tool that effectively exploits the labeled and unlabeled data in the HDR process. A semi-supervised HDR approach, called iterative multitask regression (IMR), is proposed in this paper to address this need. IMR aims at learning a low-dimensional subspace by jointly considering the labeled and unlabeled data, and also bridging the learned subspace with two regression tasks: labels and pseudo-labels initialized by a given classifier. More significantly, IMR dynamically propagates the labels on a learnable graph and progressively refines pseudo-labels, yielding a well-conditioned feedback system. Experiments conducted on three widely-used hyperspectral image datasets demonstrate that the dimension-reduced features learned by the proposed IMR framework with respect to classification or recognition accuracy are superior to those of related state-of-the-art HDR approaches.

Journal ArticleDOI
TL;DR: An explicit form of suboptimal dimensionality reduction is given against DoS attacks, while an effective attack strategy is proposed for the attacker.
Abstract: This paper studies the distributed dimensionality reduction fusion estimation problem for a class of cyber-physical systems (CPSs) under denial-of-service (DoS) attacks The problem is modeled under the resource constraints (ie, bandwidth or energy) for the defender and attacker Based on a new attack and compensation model, a recursive distributed Kalman fusion estimator (DKFE) is designed for the addressed CPSs Though the optimization objects of the defender and attacker are opposite, the corresponding optimization problems are established based on different available information In this case, an explicit form of suboptimal dimensionality reduction is given against DoS attacks, while an effective attack strategy is proposed for the attacker A stability condition is derived such that the mean square error of the designed DKFE is bounded Two illustrative examples are given to show the effectiveness of the proposed methods

Posted ContentDOI
11 Mar 2019-bioRxiv
TL;DR: It is shown that simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance outperform current practice in a downstream clustering assessment using ground-truth datasets.
Abstract: Single cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero-inflation. Current normalization pro-cedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We pro-pose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform current practice in a downstream clustering assessment using ground-truth datasets.

Book ChapterDOI
01 Jan 2019
TL;DR: A novel binary version of whale optimization algorithm (bWOA) is proposed to select the optimal feature subset for dimensionality reduction and classifications problem based on a sigmoid transfer function (S-shape).
Abstract: Whale optimization algorithm is one of the recent nature-inspired optimization technique based on the behavior of bubble-net hunting strategy. In this paper, a novel binary version of whale optimization algorithm (bWOA) is proposed to select the optimal feature subset for dimensionality reduction and classifications problem. The new approach is based on a sigmoid transfer function (S-shape). By dealing with the feature selection problem, a free position of the whale must be transformed to their corresponding binary solutions. This transformation is performed by applying an S-shaped transfer function in every dimension that defines the probability of transforming the position vectors’ elements from 0 to 1 and vice versa and hence force the search agents to move in a binary space. K-NN classifier is applied to ensure that the selected features are the relevant ones. A set of criteria are used to evaluate and compare the proposed bWOA-S with the native one over eleven different datasets. The results proved that the new algorithm has a significant performance in finding the optimal feature.

Journal ArticleDOI
TL;DR: In this article, an unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms is considered. But the learned representation is tuned to contain only phonetic content, and the decoder is used to infer information discarded by the encoder from previous samples.
Abstract: We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms. The goal is to learn a representation able to capture high level semantic content from the signal, e.g.\ phoneme identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. Since the learned representation is tuned to contain only phonetic content, we resort to using a high capacity WaveNet decoder to infer information discarded by the encoder from previous samples. Moreover, the behavior of autoencoder models depends on the kind of constraint that is applied to the latent representation. We compare three variants: a simple dimensionality reduction bottleneck, a Gaussian Variational Autoencoder (VAE), and a discrete Vector Quantized VAE (VQ-VAE). We analyze the quality of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately reconstruct individual spectrogram frames. Moreover, for discrete encodings extracted using the VQ-VAE, we measure the ease of mapping them to phonemes. We introduce a regularization scheme that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.

Journal ArticleDOI
TL;DR: This work compares 18 different dimensionality reduction methods on 30 publicly available scRNA-seq datasets that cover a range of sequencing techniques and sample sizes and provides important guidelines for choosing dimensionality Reduction methods for sc RNA-seq data analysis.
Abstract: Dimensionality reduction is an indispensable analytic component for many areas of single-cell RNA sequencing (scRNA-seq) data analysis. Proper dimensionality reduction can allow for effective noise removal and facilitate many downstream analyses that include cell clustering and lineage reconstruction. Unfortunately, despite the critical importance of dimensionality reduction in scRNA-seq analysis and the vast number of dimensionality reduction methods developed for scRNA-seq studies, few comprehensive comparison studies have been performed to evaluate the effectiveness of different dimensionality reduction methods in scRNA-seq. We aim to fill this critical knowledge gap by providing a comparative evaluation of a variety of commonly used dimensionality reduction methods for scRNA-seq studies. Specifically, we compare 18 different dimensionality reduction methods on 30 publicly available scRNA-seq datasets that cover a range of sequencing techniques and sample sizes. We evaluate the performance of different dimensionality reduction methods for neighborhood preserving in terms of their ability to recover features of the original expression matrix, and for cell clustering and lineage reconstruction in terms of their accuracy and robustness. We also evaluate the computational scalability of different dimensionality reduction methods by recording their computational cost. Based on the comprehensive evaluation results, we provide important guidelines for choosing dimensionality reduction methods for scRNA-seq data analysis. We also provide all analysis scripts used in the present study at www.xzlab.org/reproduce.html.

Journal ArticleDOI
TL;DR: In this paper, the authors introduce a method to infer small sets of time-series features that exhibit strong classification performance across a given collection of time series problems, and are minimally redundant.
Abstract: Capturing the dynamical properties of time series concisely as interpretable feature vectors can enable efficient clustering and classification for time-series applications across science and industry. Selecting an appropriate feature-based representation of time series for a given application can be achieved through systematic comparison across a comprehensive time-series feature library, such as those in the hctsa toolbox. However, this approach is computationally expensive and involves evaluating many similar features, limiting the widespread adoption of feature-based representations of time series for real-world applications. In this work, we introduce a method to infer small sets of time-series features that (i) exhibit strong classification performance across a given collection of time-series problems, and (ii) are minimally redundant. Applying our method to a set of 93 time-series classification datasets (containing over 147,000 time series) and using a filtered version of the hctsa feature library (4791 features), we introduce a set of 22 CAnonical Time-series CHaracteristics, catch22, tailored to the dynamics typically encountered in time-series data-mining tasks. This dimensionality reduction, from 4791 to 22, is associated with an approximately 1000-fold reduction in computation time and near linear scaling with time-series length, despite an average reduction in classification accuracy of just 7%. catch22 captures a diverse and interpretable signature of time series in terms of their properties, including linear and non-linear autocorrelation, successive differences, value distributions and outliers, and fluctuation scaling properties. We provide an efficient implementation of catch22, accessible from many programming environments, that facilitates feature-based time-series analysis for scientific, industrial, financial and medical applications using a common language of interpretable time-series properties.

Proceedings Article
03 Apr 2019
TL;DR: This paper proposed an unsupervised framework for detecting the stance of prolific Twitter users with respect to controversial topics using dimensionality reduction to project users onto a low-dimensional space, followed by clustering, which allows to find core users that are representative of the different stances.
Abstract: We present a highly effective unsupervised framework for detecting the stance of prolific Twitter users with respect to controversial topics. In particular, we use dimensionality reduction to project users onto a low-dimensional space, followed by clustering, which allows us to find core users that are representative of the different stances. Our framework has three major advantages over pre-existing methods, which are based on supervised or semi-supervised classification. First, we do not require any prior labeling of users: instead, we create clusters, which are much easier to label manually afterwards, e.g., in a matter of seconds or minutes instead of hours. Second, there is no need for domain- or topic-level knowledge either to specify the relevant stances (labels) or to conduct the actual labeling. Third, our framework is robust in the face of data skewness, e.g., when some users or some stances have greater representation in the data. We experiment with different combinations of user similarity features, dataset sizes, dimensionality reduction methods, and clustering algorithms to ascertain the most effective and most computationally efficient combinations across three different datasets (in English and Turkish). We further verified our results on additional tweet sets covering six different controversial topics. Our best combination in terms of effectiveness and efficiency uses retweeted accounts as features, UMAP for dimensionality reduction, and Mean Shift for clustering, and yields a small number of high-quality user clusters, typically just 2–3, with more than 98% purity. The resulting user clusters can be used to train downstream classifiers. Moreover, our framework is robust to variations in the hyper-parameter values and also with respect to random initialization.

Journal ArticleDOI
TL;DR: A face expression recognition method based on a convolutional neural network (CNN) and an image edge detection and the dimensionality reduction of the extracted implicit features is processed by the maximum pooling method.
Abstract: To avoid the complex process of explicit feature extraction in traditional facial expression recognition, a face expression recognition method based on a convolutional neural network (CNN) and an image edge detection is proposed. Firstly, the facial expression image is normalized, and the edge of each layer of the image is extracted in the convolution process. The extracted edge information is superimposed on each feature image to preserve the edge structure information of the texture image. Then, the dimensionality reduction of the extracted implicit features is processed by the maximum pooling method. Finally, the expression of the test sample image is classified and recognized by using a Softmax classifier. To verify the robustness of this method for facial expression recognition under a complex background, a simulation experiment is designed by scientifically mixing the Fer-2013 facial expression database with the LFW data set. The experimental results show that the proposed algorithm can achieve an average recognition rate of 88.56% with fewer iterations, and the training speed on the training set is about 1.5 times faster than that on the contrast algorithm.

Journal ArticleDOI
TL;DR: The experimental results suggest that the proposed automated diagnostic system has the potential to classify PD patients from healthy subjects and in future the proposed method can also be exploited for prodromal and differential diagnosis, which are considered challenging tasks.
Abstract: Objective: Parkinson’s disease (PD) is a serious neurodegenerative disorder. It is reported that most of PD patients have voice impairments. But these voice impairments are not perceptible to common listeners. Therefore, different machine learning methods have been developed for automated PD detection. However, these methods either lack generalization and clinically significant classification performance or face the problem of subject overlap. Methods: To overcome the problems discussed above, we attempt to develop a hybrid intelligent system that can automatically perform acoustic analysis of voice signals in order to detect PD. The proposed intelligent system uses linear discriminant analysis (LDA) for dimensionality reduction and genetic algorithm (GA) for hyperparameters optimization of neural network (NN) which is used as a predictive model. Moreover, to avoid subject overlap, we use leave one subject out (LOSO) validation. Results: The proposed method namely LDA-NN-GA is evaluated in numerical experiments on multiple types of sustained phonations data in terms of accuracy, sensitivity, specificity, and Matthew correlation coefficient. It achieves classification accuracy of 95% on training database and 100% on testing database using all the extracted features. However, as the dataset is imbalanced in terms of gender, thus, to obtain unbiased results, we eliminated the gender dependent features and obtained accuracy of 80% for training database and 82.14% for testing database, which seems to be more unbiased results. Conclusion: Compared with the previous machine learning methods, the proposed LDA-NN-GA method shows better performance and lower complexity. Clinical Impact: The experimental results suggest that the proposed automated diagnostic system has the potential to classify PD patients from healthy subjects. Additionally, in future the proposed method can also be exploited for prodromal and differential diagnosis, which are considered challenging tasks.