Showing papers on "Principal component analysis published in 2020"

PDF

Open Access

Journal Article•DOI•

Analysis of Dimensionality Reduction Techniques on Big Data

[...]

G. Thippa Reddy¹, M. Praveen Kumar Reddy¹, Kuruva Lakshmanna¹, Rajesh Kaluri¹, Dharmendra Singh Rajput¹, Gautam Srivastava², Thar Baker³ - Show less +3 more•Institutions (3)

VIT University¹, Brandon University², Liverpool John Moores University³

16 Mar 2020-IEEE Access

TL;DR: Two of the prominent dimensionality reduction techniques, Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are investigated on four popular Machine Learning (ML) algorithms using publicly available Cardiotocography dataset from University of California and Irvine Machine Learning Repository to prove that PCA outperforms LDA in all the measures.

...read moreread less

Abstract: Due to digitization, a huge volume of data is being generated across several sectors such as healthcare, production, sales, IoT devices, Web, organizations. Machine learning algorithms are used to uncover patterns among the attributes of this data. Hence, they can be used to make predictions that can be used by medical practitioners and people at managerial level to make executive decisions. Not all the attributes in the datasets generated are important for training the machine learning algorithms. Some attributes might be irrelevant and some might not affect the outcome of the prediction. Ignoring or removing these irrelevant or less important attributes reduces the burden on machine learning algorithms. In this work two of the prominent dimensionality reduction techniques, Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are investigated on four popular Machine Learning (ML) algorithms, Decision Tree Induction, Support Vector Machine (SVM), Naive Bayes Classifier and Random Forest Classifier using publicly available Cardiotocography (CTG) dataset from University of California and Irvine Machine Learning Repository. The experimentation results prove that PCA outperforms LDA in all the measures. Also, the performance of the classifiers, Decision Tree, Random Forest examined is not affected much by using PCA and LDA.To further analyze the performance of PCA and LDA the eperimentation is carried out on Diabetic Retinopathy (DR) and Intrusion Detection System (IDS) datasets. Experimentation results prove that ML algorithms with PCA produce better results when dimensionality of the datasets is high. When dimensionality of datasets is low it is observed that the ML algorithms without dimensionality reduction yields better results.

...read moreread less

414 citations

Journal Article•DOI•

Prediction of composite microstructure stress-strain curves using convolutional neural networks

[...]

Charles Yang¹, Young-Soo Kim², Seunghwa Ryu², Grace X. Gu¹•Institutions (2)

University of California, Berkeley¹, KAIST²

01 Apr 2020-Materials & Design

TL;DR: In this article, a combination of principal component analysis (PCA) and convolutional neural networks (CNN) is used to predict the entire stress-strain behavior of binary composites evaluated over the entire failure path.

...read moreread less

186 citations

Journal Article•DOI•

Classification models for heart disease prediction using feature selection and PCA

[...]

Anna Karen Gárate-Escamila¹, Amir Hajjam El Hassani¹, Emmanuel Andrès²•Institutions (2)

Universite de technologie de Belfort-Montbeliard¹, University of Strasbourg²

01 Jan 2020-Informatics in Medicine Unlocked

TL;DR: The experimental results proved that the combination of chi-square with PCA obtains greater performance in most classifiers and the usage of PCA directly from the raw data computed lower results and would require greater dimensionality to improve the results.

...read moreread less

180 citations

Journal Article•DOI•

Deep learning approach for microarray cancer data classification

[...]

Hema Shekar Basavegowda, Guesh Dagnew

01 Mar 2020-CAAI Transactions on Intelligence Technology

TL;DR: A deep feedforward method to classify the given microarray cancer data into a set of classes for subsequent diagnosis purposes using a 7-layer deep neural network architecture having various parameters for each dataset is developed.

...read moreread less

138 citations

Journal Article•DOI•

Studying the Manifold Structure of Alzheimer's Disease: A Deep Learning Approach Using Convolutional Autoencoders

[...]

Francisco Jesús Martínez-Murcia¹, Andrés Ortiz¹, J. M. Gorriz², Javier Ramírez², Diego Castillo-Barnes² - Show less +1 more•Institutions (2)

University of Málaga¹, University of Granada²

01 Jan 2020-IEEE Journal of Biomedical and Health Informatics

TL;DR: This work aims at finding links between cognitive symptoms and the underlying neurodegeneration process by fusing the information of neuropsychological test outcomes, diagnoses, and other clinical data with the imaging features extracted solely via a data-driven decomposition of MRI.

...read moreread less

Abstract: Many classical machine learning techniques have been used to explore Alzheimer's disease (AD), evolving from image decomposition techniques such as principal component analysis toward higher complexity, non-linear decomposition algorithms. With the arrival of the deep learning paradigm, it has become possible to extract high-level abstract features directly from MRI images that internally describe the distribution of data in low-dimensional manifolds. In this work, we try a new exploratory data analysis of AD based on deep convolutional autoencoders. We aim at finding links between cognitive symptoms and the underlying neurodegeneration process by fusing the information of neuropsychological test outcomes, diagnoses, and other clinical data with the imaging features extracted solely via a data-driven decomposition of MRI. The distribution of the extracted features in different combinations is then analyzed and visualized using regression and classification analysis, and the influence of each coordinate of the autoencoder manifold over the brain is estimated. The imaging-derived markers could then predict clinical variables with correlations above 0.6 in the case of neuropsychological evaluation variables such as the MMSE or the ADAS11 scores, achieving a classification accuracy over 80% for the diagnosis of AD.

...read moreread less

124 citations

Journal Article•DOI•

Remote Sensing Image Classification Based on a Cross-Attention Mechanism and Graph Convolution

[...]

Weiwei Cai¹, Zhanguo Wei¹•Institutions (1)

Central South University Forestry and Technology¹

01 Oct 2020-IEEE Geoscience and Remote Sensing Letters

TL;DR: A novel cross-attention mechanism and graph convolution integration algorithm that achieves better performances than do other well-known algorithms using different methods of training set division to obtain better hyperspectral data classification results.

...read moreread less

Abstract: An attention mechanism assigns different weights to different features to help a model select the features most valuable for accurate classification. However, the traditional attention mechanism algorithm often allocates weights in a one-way fashion, which can result in a loss of feature information. To obtain better hyperspectral data classification results, a novel cross-attention mechanism and graph convolution integration algorithm are proposed in this letter. First, principal component analysis is used to reduce the dimensionality of hyperspectral images to obtain low-dimensional features that are more expressive. Second, the model uses a cross (horizontal and vertical directions) attention algorithm to allocate weights jointly based on its two strategies; then, it adopts a graph convolution algorithm to generate the directional relationships between the features. Finally, the generated deep features and the relationship between the deep features are used to complete the prediction of hyperspectral data. Experiments on three well-known hyperspectral data sets--Indian Pines, the University of Pavia, and Salinas--show that the proposed algorithm achieves better performances than do other well-known algorithms using different methods of training set division.

...read moreread less

117 citations

Journal Article•DOI•

A geographically weighted regression model augmented by Geodetector analysis and principal component analysis for the spatial distribution of PM2.5

[...]

Rui Zhao¹, Liping Zhan¹, Mingxing Yao¹, Linchuan Yang¹•Institutions (1)

Southwest Jiaotong University¹

01 May 2020-Sustainable Cities and Society

TL;DR: In this paper, an augmented geographically weighted regression (GWR) model was developed to analyze the spatial distribution of PM2.5 concentrations through the incorporation of Geodetector analysis and principal component analysis (PCA).

...read moreread less

105 citations

Journal Article•DOI•

Dimensionality Reduction for Binary Data through the Projection of Natural Parameters

[...]

Andrew J. Landgraf, Yoonkyung Lee¹•Institutions (1)

Ohio State University¹

21 Aug 2020-Journal of Multivariate Analysis

TL;DR: This work proposes a new formulation of logistic PCA which extends Pearson’s formulation of a low dimensional data representation with minimum error to binary data and derives explicit solutions for data matrices of special structure and provides a computationally efficient algorithm for solving for the principal component loadings.

...read moreread less

99 citations

Journal Article•DOI•

Hyperspectral Anomaly Detection With Kernel Isolation Forest

[...]

Shutao Li¹, Kunzhong Zhang¹, Puhong Duan¹, Xudong Kang¹•Institutions (1)

Hunan University¹

01 Jan 2020-IEEE Transactions on Geoscience and Remote Sensing

TL;DR: Experimental results on several real hyperspectral data sets demonstrate that the proposed method outperforms other state-of-the-art methods.

...read moreread less

Abstract: In this article, a novel hyperspectral anomaly detection method with kernel Isolation Forest (iForest) is proposed. The method is based on an assumption that anomalies rather than background can be more susceptible to isolation in the kernel space. Based on this idea, the proposed method detects anomalies as follows. First, the hyperspectral data are mapped into the kernel space, and the first $K$ principal components are used. Then, the isolation samples in the image are detected with the iForest constructed using randomly selected samples in the principal components. Finally, the initial anomaly detection map is iteratively refined with locally constructed iForest in connected regions with large areas. Experimental results on several real hyperspectral data sets demonstrate that the proposed method outperforms other state-of-the-art methods.

...read moreread less

95 citations

Journal Article•DOI•

Employing PCA and t-statistical approach for feature extraction and classification of emotion from multichannel EEG signal

[...]

Md. Asadur Rahman¹, Md. Foisal Hossain¹, Mazhar Hossain¹, Rasel Ahmmed²•Institutions (2)

Khulna University of Engineering & Technology¹, Bangabandhu Sheikh Mujibur Rahman Science and Technology University²

01 Mar 2020-Egyptian Informatics Journal

TL;DR: This work contributes to successfully implement spatial PCA to reduce signal dimensionality and to select the suitable features based on the t-statistical inferences among the classes to achieve a highly efficient brain-computer interface (BCI) system regarding emotion recognition from electroencephalogram signal.

...read moreread less

86 citations

Journal Article•DOI•

Regional soil organic carbon prediction model based on a discrete wavelet analysis of hyperspectral satellite data

[...]

Xiangtian Meng¹, Yilin Bao¹, Jiangui Liu², Huanjun Liu¹, Huanjun Liu³, Xinle Zhang¹, Yu Zhang, Peng Wang, Haitao Tang¹, Fan-chang Kong¹ - Show less +6 more•Institutions (3)

Northeast Agricultural University¹, Agriculture and Agri-Food Canada², Chinese Academy of Sciences³

01 Jul 2020-International Journal of Applied Earth Observation and Geoinformation

TL;DR: This study provides a highly robust and accurate method for predicting and mapping regional SOC contents and indicates that at a low decomposition scale, DWT can effectively eliminate the noise in satellite hyperspectral data, and the FDR combined withDWT can improve the SOC prediction accuracy significantly.

...read moreread less

Journal Article•DOI•

Dimensionality reduction and visualisation of hyperspectral ink data using t-SNE.

[...]

Binu Melit Devassy¹, Sony George¹•Institutions (1)

Norwegian University of Science and Technology¹

12 Feb 2020-Forensic Science International

TL;DR: An advanced approach known as t-Distributed Stochastic Neighbor embedding (t-SNE) algorithm is introduced into the ink analysis problem, which extracts the non-linear similarity features between spectra to scale them into a lower dimension.

...read moreread less

Journal Article•DOI•

Gap‐filling approaches for eddy covariance methane fluxes: A comparison of three machine learning algorithms and a traditional method with principal component analysis

[...]

Yeonuk Kim¹, Mark S. Johnson¹, Sara H. Knox¹, T. Andrew Black¹, Higo J. Dalmagro, Minseok Kang, Joon Kim², Dennis D. Baldocchi³ - Show less +4 more•Institutions (3)

University of British Columbia¹, Seoul National University², University of California, Berkeley³

01 Mar 2020-Global Change Biology

TL;DR: It is found gap‐filling uncertainty is much larger than measurement uncertainty in accumulated CH4 budget, and therefore, the approach used for FCH4 gap filling can have important implications for characterizing annual ecosystem‐scale methane budgets.

...read moreread less

Abstract: Methane flux (FCH4 ) measurements using the eddy covariance technique have increased over the past decade. FCH4 measurements commonly include data gaps, as is the case with CO2 and energy fluxes. However, gap-filling FCH4 data are more challenging than other fluxes due to its unique characteristics including multidriver dependency, variabilities across multiple timescales, nonstationarity, spatial heterogeneity of flux footprints, and lagged influence of biophysical drivers. Some researchers have applied a marginal distribution sampling (MDS) algorithm, a standard gap-filling method for other fluxes, to FCH4 datasets, and others have applied artificial neural networks (ANN) to resolve the challenging characteristics of FCH4 . However, there is still no consensus regarding FCH4 gap-filling methods due to limited comparative research. We are not aware of the applications of machine learning (ML) algorithms beyond ANN to FCH4 datasets. Here, we compare the performance of MDS and three ML algorithms (ANN, random forest [RF], and support vector machine [SVM]) using multiple combinations of ancillary variables. In addition, we applied principal component analysis (PCA) as an input to the algorithms to address multidriver dependency of FCH4 and reduce the internal complexity of the algorithmic structures. We applied this approach to five benchmark FCH4 datasets from both natural and managed systems located in temperate and tropical wetlands and rice paddies. Results indicate that PCA improved the performance of MDS compared to traditional inputs. ML algorithms performed better when using all available biophysical variables compared to using PCA-derived inputs. Overall, RF was found to outperform other techniques for all sites. We found gap-filling uncertainty is much larger than measurement uncertainty in accumulated CH4 budget. Therefore, the approach used for FCH4 gap filling can have important implications for characterizing annual ecosystem-scale methane budgets, the accuracy of which is important for evaluating natural and managed systems and their interactions with global change processes.

...read moreread less

Journal Article•DOI•

Sparse Principal Component Analysis via Variable Projection

[...]

N. Benjamin Erichson, Peng Zheng, Krithika Manohar, Steven L. Brunton, J. Nathan Kutz, Aleksandr Y. Aravkin - Show less +2 more

27 Apr 2020-Siam Journal on Applied Mathematics

TL;DR: A robust and scalable SPCA algorithm is demonstrated by formulating it as a value-function optimization problem, which can further leverage randomized methods from linear algebra to extend the approach to the large-scale (big data) setting.

...read moreread less

Abstract: Sparse principal component analysis (SPCA) has emerged as a powerful technique for modern data analysis, providing improved interpretation of low-rank structures by identifying localized spatial st...

...read moreread less

Journal Article•DOI•

Path Loss Prediction based on Machine Learning Techniques: Principal Component Analysis, Artificial Neural Network and Gaussian Process.

[...]

Han-Shin Jo¹, Chanshin Park², Eunhyoung Lee³, Haing Kun Choi, Jaedon Park³ - Show less +1 more•Institutions (3)

Hanbat National University¹, University of Southern California², Agency for Defense Development³

30 Mar 2020-Sensors

TL;DR: It is observed that the proposed combined path loss and shadowing model is more accurate and flexible compared to the conventional linear path loss plus log-normalshadowing model.

...read moreread less

Abstract: Although various linear log-distance path loss models have been developed for wireless sensor networks, advanced models are required to more accurately and flexibly represent the path loss for complex environments. This paper proposes a machine learning framework for modeling path loss using a combination of three key techniques: artificial neural network (ANN)-based multi-dimensional regression, Gaussian process-based variance analysis, and principle component analysis (PCA)-aided feature selection. In general, the measured path loss dataset comprises multiple features such as distance, antenna height, etc. First, PCA is adopted to reduce the number of features of the dataset and simplify the learning model accordingly. ANN then learns the path loss structure from the dataset with reduced dimension, and Gaussian process learns the shadowing effect. Path loss data measured in a suburban area in Korea are employed. We observe that the proposed combined path loss and shadowing model is more accurate and flexible compared to the conventional linear path loss plus log-normal shadowing model.

...read moreread less

Journal Article•DOI•

Benchmarking principal component analysis for large-scale single-cell RNA-sequencing

[...]

Koki Tsuyuzaki¹, Hiroyuki Sato², Kenta Sato³, Itoshi Nikaido⁴•Institutions (4)

National Presto Industries¹, Kyoto University², University of Tokyo³, University of Tsukuba⁴

20 Jan 2020-Genome Biology

TL;DR: A guideline is developed to select an appropriate PCA implementation based on the differences in the computational environment of users and developers to show that some PCA algorithms based on Krylov subspace and randomized singular value decomposition are fast, memory-efficient, and more accurate than the other algorithms.

...read moreread less

Abstract: Principal component analysis (PCA) is an essential method for analyzing single-cell RNA-seq (scRNA-seq) datasets, but for large-scale scRNA-seq datasets, computation time is long and consumes large amounts of memory. In this work, we review the existing fast and memory-efficient PCA algorithms and implementations and evaluate their practical application to large-scale scRNA-seq datasets. Our benchmark shows that some PCA algorithms based on Krylov subspace and randomized singular value decomposition are fast, memory-efficient, and more accurate than the other algorithms. We develop a guideline to select an appropriate PCA implementation based on the differences in the computational environment of users and developers.

...read moreread less

Journal Article•DOI•

XPS spectral analysis for a multiple oxide comprising NiO, TiO2, and NiTiO3

[...]

Kota Sakamoto¹, Kota Sakamoto², Fumio Hayashi¹, Kazuyoshi Sato¹, Mitsuhiro Hirano², Naofumi Ohtsu² - Show less +2 more•Institutions (2)

Gunma University¹, Kitami Institute of Technology²

01 Oct 2020-Applied Surface Science

TL;DR: In this article, principal component analysis (PCA) was applied to angle-resolved XPS spectra for thermally oxidized nickel titanium alloy to determine the ratio of various oxides in the mixture.

...read moreread less

Journal Article•DOI•

A machine learning framework for the analysis and prediction of catalytic activity from experimental data

[...]

Alexander R. H. Smith¹, Andrea Keane¹, James A. Dumesic¹, George W. Huber¹, Victor M. Zavala¹ - Show less +1 more•Institutions (1)

University of Wisconsin-Madison¹

01 Apr 2020-Applied Catalysis B-environmental

TL;DR: A machine learning framework to explore the predictability limits of catalytic activity from experimental descriptor data (which characterizes catalyst formulations and reaction conditions) is presented.

...read moreread less

Abstract: We present a machine learning framework to explore the predictability limits of catalytic activity from experimental descriptor data (which characterizes catalyst formulations and reaction conditions). Artificial neural networks are used to fuse descriptor data to predict activity and we use principal component analysis (PCA) and sparse PCA to project the experimental data into an information space and with this identify regions that exhibit low- and high-predictability. Our framework also incorporates a constrained-PCA optimization formulation that identifies new experimental points while filtering out regions in the experimental space due to constraints on technology, economics, and expert knowledge. This allows us to navigate the experimental space in a more targeted manner. Our framework is applied to a comprehensive water–gas shift reaction data set, which contains 2228 experimental data points collected from the literature. Neural network analysis reveals strong predictability of activity across reaction conditions (e.g., varying temperature) but also reveals important gaps in predictability across catalyst formulations (e.g., varying metal, support, and promoter). PCA analysis reveals that these gaps are due to the fact that most experiments reported in the literature lie within narrow regions in the information space. We demonstrate that our framework can systematically guide experiments and the selection of descriptors in order to improve predictability and identify new promising formulations.

...read moreread less

Journal Article•DOI•

Classification of human hand movements based on EMG signals using nonlinear dimensionality reduction and data fusion techniques

[...]

Neta Rabin¹, Maayan Kahlon¹, Maayan Kahlon², Sarit Malayev¹, Sarit Malayev², Anat Ratnovsky² - Show less +2 more•Institutions (2)

Tel Aviv University¹, Afeka College of Engineering²

01 Jul 2020-Expert Systems With Applications

TL;DR: The objectives of this study were to compare between the performances of a nonlinear dimensionality technique to a standard linear dimensionality method when applied for single subject EMG based hand movement classification, and to examined their performances in case of limited amount of training data samples.

...read moreread less

Abstract: Surface electromyography (EMG) is non-invasive signal acquisition technique that plays a central role in many application, including clinical diagnostics, control for prosthetic devices and for human-machine interactions. The processing typically begins with a feature extraction step, which may be followed by the application of a dimensionality reduction technique. The obtained reduced features are input for a machine learning classifier. The constructed machine learning model may then classify new recorded movements. The features extracted for EMG signals usually capture information both from the time and from the frequency domain. Short time Fourier transform (STFT) is commonly used for signal processing and in particular for EMG processing since it captures the temporal and the frequency characteristics of the data. Since the number of calculated STFT features is large, a common approach in signal processing and machine learning applications is to apply a linear or a nonlinear dimensionality reduction technique for simplifying the feature space. Another aspect that arises in medical applications in general and in EMG based hand classification in particular, is the large variability between subjects. Due to this variability, many studies focus on single subject classification. This requires acquiring a large training set for each tested participant which is not practical in real life application. The objectives of this study were first to compare between the performances of a nonlinear dimensionality technique to a standard linear dimensionality method when applied for single subject EMG based hand movement classification, and to examined their performances in case of limited amount of training data samples. The second objective was to propose an algorithm for multi-subjects classification that utilized a data alignment step for overcoming the large variability between subjects. The data set included EMG signals from 5 subjects who perform 6 different hand movements. STFT was calculated for feature extraction, principal component analysis (PCA) and diffusion maps (DM) were compared for dimension reductions. An affine transformation for aligning between the reduced feature spaces of two subjects, was investigated. K-nearest neighbors (KNN) was used for single and multi-subject classification. The results of this study clearly show that the DM outperformed the PCA in case of limited training data. In addition, the multi-subject classification approach, which utilizes dimension reduction methods along with an alignment algorithm enable robust classification of a new subject based on another subjects’ data sets. The proposed framework is general and can be adopted for many EMG classification task.

...read moreread less

Journal Article•DOI•

Estimating Latent Asset-Pricing Factors

[...]

Martin Lettau¹, Martin Lettau², Markus Pelger³•Institutions (3)

University of California, Berkeley¹, Center for Economic and Policy Research², Stanford University³

01 Feb 2020-Journal of Econometrics

TL;DR: In this paper, the authors developed an estimator for latent factors in a large-dimensional panel of financial data that can explain expected excess returns. But their estimator cannot find asset-pricing factors, which cannot be detected with PCA, even if a large amount of data is available.

...read moreread less

Journal Article•DOI•

A Novel Dynamic Weight Principal Component Analysis Method and Hierarchical Monitoring Strategy for Process Fault Detection and Diagnosis

[...]

Yang Tao¹, Hongbo Shi¹, Bing Song¹, Shuai Tan¹•Institutions (1)

East China University of Science and Technology¹

01 Sep 2020-IEEE Transactions on Industrial Electronics

TL;DR: A novel dynamic weight principal component analysis (DWPCA) algorithm and a hierarchical monitoring strategy are proposed to further increase the fault detection rate while preserving the universality of the algorithm.

...read moreread less

Abstract: Traditional monitoring algorithms use the normal data for modeling, which are universal for different types of faults. However, these algorithms may perform poorly sometimes because of the lack of fault information. In order to further increase the fault detection rate while preserving the universality of the algorithm, a novel dynamic weight principal component analysis (DWPCA) algorithm and a hierarchical monitoring strategy are proposed. In the first layer, the dynamic PCA is used for fault detection and diagnosis, if no fault is detected, the following DWPCA-based second layer monitoring will be triggered. In the second layer, the principal components (PCs) are weighted according to its ability in distinguishing between the normal and fault conditions, then the PCs which own larger weight are selected to construct the monitoring model. Compared to the DPCA method, the proposed DWPCA algorithm establishes the monitoring model by combining the information of fault. Afterward, the DWPCA-based variable relative contribution and a novel control limit for the variable relative contribution are presented for the fault diagnosis. Finally, the superiority of the proposed method is demonstrated by a numerical case and the Tennessee Eastman process.

...read moreread less

Journal Article•DOI•

A comparative dimensionality reduction study in telecom customer segmentation using deep learning and PCA

[...]

Maha Alkhayrat¹, Mohamad Aljnidi¹, Kadan Aljoumaa¹•Institutions (1)

Higher Institute for Applied Sciences and Technology¹

01 Feb 2020-Journal of Big Data

TL;DR: This paper aims to explore dimensionality reduction on a real telecom dataset and evaluate customers’ clustering in reduced and latent space, compared to original space in order to achieve better quality clustering results.

...read moreread less

Abstract: Telecom Companies logs customer’s actions which generate a huge amount of data that can bring important findings related to customer’s behavior and needs. The main characteristics of such data are the large number of features and the high sparsity that impose challenges to the analytics steps. This paper aims to explore dimensionality reduction on a real telecom dataset and evaluate customers’ clustering in reduced and latent space, compared to original space in order to achieve better quality clustering results. The original dataset contains 220 features that belonging to 100,000 customers. However, dimensionality reduction is an important data preprocessing step in the data mining process specially with the presence of curse of dimensionality. In particular, the aim of data reduction techniques is to filter out irrelevant features and noisy data samples. To reduce the high dimensional data, we projected it down to a subspace using well known Principal Component Analysis (PCA) decomposition and a novel approach based on Autoencoder Neural Network, performing in this way dimensionality reduction of original data. Then K-Means Clustering is applied on both-original and reduced data set. Different internal measures were performed to evaluate clustering for different numbers of dimensions and then we evaluated how the reduction method impacts the clustering task.

...read moreread less

Journal Article•DOI•

Fast hybrid dimensionality reduction method for classification based on feature selection and grouped feature extraction

[...]

Mengmeng Li¹, Mengmeng Li², Haofeng Wang¹, Haofeng Wang², Lifang Yang¹, Lifang Yang², You Liang¹, You Liang², Zhigang Shang², Zhigang Shang¹, Hong Wan¹, Hong Wan² - Show less +8 more•Institutions (2)

Zhengzhou University¹, Industrial Technology Research Institute²

15 Jul 2020-Expert Systems With Applications

TL;DR: This paper combines multi-strategy feature selection and grouped feature extraction and a novel fast hybrid dimension reduction method, incorporating their advantages of removing irrelevant and redundant information to reduce the dimensionality of the raw data fast.

...read moreread less

Abstract: Dimensionality reduction is one basic and critical technology for data mining, especially in current “big data” era. As two different types of methods, feature selection and feature extraction each have their pros and cons. In this paper, we combine multi-strategy feature selection and grouped feature extraction and propose a novel fast hybrid dimension reduction method, incorporating their advantages of removing irrelevant and redundant information. Firstly, the intrinsic dimensionality of the data set is estimated by the maximum likelihood estimation method. Fisher Score and Information Gain based feature selection are used as multi-strategy methods to remove irrelevant features. With the redundancy among the selected features as clustering criterion, they are grouped into a certain amount of clusters. In every cluster, Principal Component Analysis (PCA) based feature extraction is carried out to remove redundant information. Four classical classifiers and representation entropy are used to evaluate the classification performance and information loss of the reduced set. The runtime results of different methods show that the proposed hybrid method is consistently much faster than the other three in almost all of the sets used. Meanwhile, the proposed method shows competitive classification performance, which has no significant difference basically compared with the other methods. The proposed method reduces the dimensionality of the raw data fast and it has excellent efficiency and competitive classification performance compared with the contrastive methods.

...read moreread less

Posted Content•DOI•

On stability of Canonical Correlation Analysis and Partial Least Squares with application to brain-behavior associations

[...]

Markus Helmer¹, Shaun Warrington², Ali-Reza Mohammadi-Nejad³, Ali-Reza Mohammadi-Nejad², Jie Lisa Ji¹, Amber Howell¹, Benjamin Rosand¹, Alan Anticevic¹, Stamatios N. Sotiropoulos³, Stamatios N. Sotiropoulos², Stamatios N. Sotiropoulos⁴, John D. Murray¹ - Show less +8 more•Institutions (4)

Yale University¹, University of Nottingham², National Institute for Health Research³, John Radcliffe Hospital⁴

25 Aug 2020-bioRxiv

TL;DR: A generative model is developed to simulate synthetic datasets with multivariate associations, and characterized how obtained feature profiles can be unstable, which hinders interpretability and generalizability, unless a sufficient number of samples are available to estimate them.

...read moreread less

Abstract: Associations between high-dimensional datasets, each comprising many features, can be discovered through multivariate statistical methods, like Canonical Correlation Analysis (CCA) or Partial Least Squares (PLS). CCA and PLS are widely used methods which reveal which features carry the association. Despite the longevity and popularity of CCA/PLS approaches, their application to high-dimensional datasets raises critical questions about the reliability of CCA/PLS solutions. In particular, overfitting can produce solutions that are not stable across datasets, which severely hinders their interpretability and generalizability. To study these issues, we developed a generative model to simulate synthetic datasets with multivariate associations, parameterized by feature dimensionality, data variance structure, and assumed latent association strength. We found that resulting CCA/PLS associations could be highly inaccurate when the number of samples per feature is relatively small. For PLS, the profiles of feature weights exhibit detrimental bias toward leading principal component axes. We confirmed these model trends in state-ofthe-art datasets containing neuroimaging and behavioral measurements in large numbers of subjects, namely the Human Connectome Project (n ≈ 1000) and UK Biobank (n = 20000), where we found that only the latter comprised enough samples to obtain stable estimates. Analysis of the neuroimaging literature using CCA to map brain-behavior relationships revealed that the commonly employed sample sizes yield unstable CCA solutions. Our generative modeling framework provides a calculator of dataset properties required for stable estimates. Collectively, our study characterizes dataset properties needed to limit the potentially detrimental effects of overfitting on stability of CCA/PLS solutions, and provides practical recommendations for future studies. Significance Statement Scientific studies often begin with an observed association between different types of measures. When datasets comprise large numbers of features, multivariate approaches such as canonical correlation analysis (CCA) and partial least squares (PLS) are often used. These methods can reveal the profiles of features that carry the optimal association. We developed a generative model to simulate data, and characterized how obtained feature profiles can be unstable, which hinders interpretability and generalizability, unless a sufficient number of samples is available to estimate them. We determine sufficient sample sizes, depending on properties of datasets. We also show that these issues arise in neuroimaging studies of brain-behavior relationships. We provide practical guidelines and computational tools for future CCA and PLS studies.

...read moreread less

Journal Article•DOI•

Multi-fault Condition Monitoring of Slurry Pump with Principle Component Analysis and Sequential Hypothesis Test

[...]

Hanxin Chen¹, Huang Wenjian¹, Jinmin Huang¹, Cao Chenghao¹, Liu Yang¹, Yibin He¹, Li Zeng¹ - Show less +3 more•Institutions (1)

Wuhan Institute of Technology¹

30 Jun 2020-International Journal of Pattern Recognition and Artificial Intelligence

TL;DR: A new method about the multi-fault condition monitoring of slurry pump based on principal component analysis (PCA) and sequential probability ratio test (SPRT) is proposed.

...read moreread less

Abstract: A new method about the multi-fault condition monitoring of slurry pump based on principal component analysis (PCA) and sequential probability ratio test (SPRT) is proposed. The method identifies th...

...read moreread less

Journal Article•DOI•

Application of High Resolution Mass Spectrometric methods coupled with chemometric techniques in olive oil authenticity studies - A review

[...]

Natasa P. Kalogiouri¹, Reza Aalizadeh¹, Marilena E. Dasenaki¹, Nikolaos S. Thomaidis¹•Institutions (1)

National and Kapodistrian University of Athens¹

16 Oct 2020-Analytica Chimica Acta

TL;DR: The objective of this review is to demonstrate the analytical performance of High Resolution Mass Spectrometry (HRMS) in the field of food authenticity assessment, allowing the determination of a wide range of food constituents with exceptional identification capabilities.

...read moreread less

Journal Article•DOI•

A novel methodology to predict monthly municipal water demand based on weather variables scenario

[...]

Salah L. Zubaidi¹, Khalid S. Hashim², Khalid S. Hashim³, Saleem Ethaib⁴, Nabeel Saleem Saad Al-Bdairi¹, Hussein Al-Bugharbee¹, Sadik Kamel Gharghan - Show less +3 more•Institutions (4)

University of Wasit¹, University of Babylon², Liverpool John Moores University³, Thi Qar University⁴

26 Sep 2020-Journal of King Saud University: Engineering Sciences

TL;DR: This study provides a novel methodology to predict monthly water demand based on several weather variables scenarios by using combined techniques including discrete wavelet transform, principal component analysis, and particle swarm optimisation.

...read moreread less

Abstract: This study provides a novel methodology to predict monthly water demand based on several weather variables scenarios by using combined techniques including discrete wavelet transform, principal component analysis, and particle swarm optimisation. To our knowledge, the adopted approach is the first technique to be proposed and applied in the water demand prediction. Compared to traditional methods, the developed methodology is superior in terms of predictive accuracy and runtime. Water consumption coupled with weather variables of the Melbourne City, from 2006 to 2015, were obtained from the South East Water retail company. The results showed that using data pre-processing techniques can significantly improve the quality of data and to select the best model input scenario. Additionally, it was noticed that the particle swarm optimisation algorithm accurately predicts the constants of the suggested model. Furthermore, the results confirmed that the proposed methodology accurately estimated the monthly data of municipal water demand based on a range of statistical criteria.

...read moreread less

Journal Article•DOI•

A Simple and Fast Algorithm for L1-Norm Kernel PCA

[...]

Cheolmin Kim¹, Diego Klabjan¹•Institutions (1)

Northwestern University¹

01 Aug 2020-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A novel reformulation of L1-norm kernel PCA is provided through which an equivalent, geometrically interpretable problem is obtained and a “fixed-point” type algorithm that iteratively computes a binary weight for each observation is presented.

...read moreread less

Abstract: We present an algorithm for L1-norm kernel PCA and provide a convergence analysis for it. While an optimal solution of L2-norm kernel PCA can be obtained through matrix decomposition, finding that of L1-norm kernel PCA is not trivial due to its non-convexity and non-smoothness. We provide a novel reformulation through which an equivalent, geometrically interpretable problem is obtained. Based on the geometric interpretation of the reformulated problem, we present a “fixed-point” type algorithm that iteratively computes a binary weight for each observation. As the algorithm requires only inner products of data vectors, it is computationally efficient and the kernel trick is applicable. In the convergence analysis, we show that the algorithm converges to a local optimal solution in a finite number of steps. Moreover, we provide a rate of convergence analysis, which has been never done for any L1-norm PCA algorithm, proving that the sequence of objective values converges at a linear rate. In numerical experiments, we show that the algorithm is robust in the presence of entry-wise perturbations and computationally scalable, especially in a large-scale setting. Lastly, we introduce an application to outlier detection where the model based on the proposed algorithm outperforms the benchmark algorithms.

...read moreread less

Journal Article•DOI•

Saliency Detection via the Improved Hierarchical Principal Component Analysis Method

[...]

Yuantao Chen¹, Jiajun Tao¹, Qian Zhang, Kai Yang, Xi Chen¹, Jie Xiong², Runlong Xia, Jingbo Xie - Show less +4 more•Institutions (2)

Changsha University of Science and Technology¹, Yangtze University²

05 May 2020-Wireless Communications and Mobile Computing

TL;DR: The HPCA model’s conclusion can obviously reduce the interference of redundant information and effectively separate the saliency object from the background, and it had more improved detection accuracy than others.

...read moreread less

Abstract: Aiming at the problems of intensive background noise, low accuracy, and high computational complexity of the current significant object detection methods, the visual saliency detection algorithm based on Hierarchical Principal Component Analysis (HPCA) has been proposed in the paper. Firstly, the original RGB image has been converted to a grayscale image, and the original grayscale image has been divided into eight layers by the bit surface stratification technique. Each image layer contains significant object information matching the layer image features. Secondly, taking the color structure of the original image as the reference image, the grayscale image is reassigned by the grayscale color conversion method, so that the layered image not only reflects the original structural features but also effectively preserves the color feature of the original image. Thirdly, the Principal Component Analysis (PCA) has been performed on the layered image to obtain the structural difference characteristics and color difference characteristics of each layer of the image in the principal component direction. Fourthly, two features are integrated to get the saliency map with high robustness and to further refine our results; the known priors have been incorporated on image organization, which can place the subject of the photograph near the center of the image. Finally, the entropy calculation has been used to determine the optimal image from the layered saliency map; the optimal map has the least background information and most prominently saliency objects than others. The object detection results of the proposed model are closer to the ground truth and take advantages of performance parameters including precision rate (PRE), recall rate (REC), and - measure (FME). The HPCA model’s conclusion can obviously reduce the interference of redundant information and effectively separate the saliency object from the background. At the same time, it had more improved detection accuracy than others.

...read moreread less

Journal Article•DOI•

Lateral-Slice Sparse Tensor Robust Principal Component Analysis for Hyperspectral Image Classification

[...]

Weiwei Sun¹, Gang Yang¹, Jiangtao Peng², Qian Du³•Institutions (3)

Ningbo University¹, Hubei University², Mississippi State University³

01 Jan 2020-IEEE Geoscience and Remote Sensing Letters

TL;DR: The experiments on two hyperspectral data sets show that the LSSTRPCA can successfully remove outliers or gross errors and achieve higher accuracies than both the original robust principal component analysis (RPCA) and tensor robust principal components analysis (TRPCA).

...read moreread less

Abstract: This letter proposes a lateral-slice sparse tensor robust principal component analysis (LSSTRPCA) method to remove gross errors or outliers from hyperspectral images so as to promote the performance of subsequent classification. The LSSTRPCA assumes that a three-order hyperspectral tensor has a low-rank structure, and gross errors or outliers are sparsely scattered in a 2-D space (i.e., lateral-slice) of the tensor. It formulates a low-rank and sparse tensor decomposition problem into a convex problem and then implements the inexact augmented Lagrange multiplier method to solve it. The experiments on two hyperspectral data sets show that the LSSTRPCA can successfully remove outliers or gross errors and achieve higher accuracies than both the original robust principal component analysis (RPCA) and tensor robust principal component analysis (TRPCA).

...read moreread less

Collapse