scispace - formally typeset
Search or ask a question

Showing papers on "Principal component analysis published in 2017"


Journal ArticleDOI
TL;DR: PCA reduces data by geometrically projecting them onto lower dimensions called principal components, with the goal of finding the best summary of the data using a limited number of PCs, whereas linear regression minimizes the distance between the response variable and its predicted value.
Abstract: Principal component analysis (PCA) simplifies the complexity in high-dimensional data while retaining trends and patterns. It does this by transforming the data into fewer dimensions, which act as summaries of features. High-dimensional data are very common in biology and arise when multiple features, such as expression of many genes, are measured for each sample. This type of data presents several challenges that PCA mitigates: computational expense and an increased error rate due to multiple test correction when testing each feature for association with an outcome. PCA is an unsupervised learning method and is similar to clustering1—it finds patterns without reference to prior knowledge about whether the samples come from different treatment groups or have phenotypic differences. PCA reduces data by geometrically projecting them onto lower dimensions called principal components (PCs), with the goal of finding the best summary of the data using a limited number of PCs. The first PC is chosen to minimize the total distance between the data and their projection onto the PC (Fig. 1a). By minimizing this distance, we also maximize the variance of the projected points, σ2 (Fig. 1b). The second (and subsequent) PCs are selected similarly, with the additional requirement that they be uncorrelated with all previous PCs. For example, projection onto PC1 is uncorrelated with projection onto PC2, and we can think of the PCs as geometrically orthogonal. This requirement of no correlation means that the maximum number of PCs possible is either the number of samples or the number of features, whichever is smaller. The PC selection process has the effect of maximizing the correlation (r2) (ref. 2) between data and their projection and is equivalent to carrying out multiple linear regression3,4 on the projected data against each variable of the original data. For example, the projection onto PC2 has maximum r2 when used in multiple regression with PC1. The PCs are defined as a linear combination of the data’s original variables, and in our two-dimensional (2D) example, PC1 = x/√2 + y/√2 (Fig. 1c). These coefficients are stored in a ‘PCA loading matrix’, which can be interpreted as a rotation matrix that rotates data such that the projection with greatest variance goes along the first axis. At first glance, PC1 closely resembles the linear regression line3 of y versus x or x versus y (Fig. 1c). However, PCA differs from linear regression in that PCA minimizes the perpendicular distance between a data point and the principal component, whereas linear regression minimizes the distance between the response variable and its predicted value. To illustrate PCA on biological data, we simulated expression profiles for nine genes that fall into one of three patterns across six samples (Fig. 2a). We find that the variance is fairly similar across samples (Fig. 2a), which tells us that no single sample captures the patterns in the data appreciably more than another. In other words, we need all six sample dimensions to express the data fully. Let’s now use PCA to see whether a smaller number of combinations of samples can capture the patterns. We start by finding the six PCs (PC1–PC6), which become our new axes (Fig. 2b). We next transform the profiles so that they are expressed as linear combinations of PCs—each profile is now a set of coordinates on the PC axes—and calculate the variance (Fig. 2c). As expected, PC1 has the largest variance, with 52.6% captured by PC1 and 47.0% captured by PC2. A useful interpretation of PCA is that r2 of the regression is the percent variance (of all the data) explained by the PCs. As additional PCs are added to the prediction, the difference in r2 corresponds to the variance explained by that PC. However, all the PCs are not typically used because the majority of variance, and hence patterns in the data, will be limited to the first few PCs. In our example, we can ignore PC3−PC6, which contribute little (0.4%) to explaining the variance, and express the data in two dimensions instead of six. Figure 2d verifies visually that we can faithfully reproduce the profiles using only PC1 and PC2. For example, the root mean square (r.m.s.) distances of the original profile A from its 1D, 2D and 3D reconstructions are 0.29, 0.03 and 0.01, respectively. Approximations using two or three PCs are useful, because we Figure 2 | PCA reduction of nine expression profiles from six to two dimensions. (a) Expression profiles for nine genes (A–I) across six samples (a−f), coded by color on the basis of shape similarity, and the expression variance of each sample. (b) PC1–PC6 of the profiles in a. PC1 and PC2 reflect clearly visible trends, and the remaining capture only small fluctuations. (c) Transformed profiles, expressed as PC scores and σ2 of each component score. (d) The profiles reconstructed using PC1–PC3. (e) The 2D coordinates of each profile based on the scores of the first two PCs. 0 0.6

700 citations


Journal ArticleDOI
TL;DR: Unsupervised machine learning techniques to learn features that best describe configurations of the two-dimensional Ising model and the three-dimensional XY model are examined, finding that the most promising algorithms are principal component analysis and variational autoencoders.
Abstract: We examine unsupervised machine learning techniques to learn features that best describe configurations of the two-dimensional Ising model and the three-dimensional XY model. The methods range from principal component analysis over manifold and clustering methods to artificial neural-network-based variational autoencoders. They are applied to Monte Carlo-sampled configurations and have, a priori, no knowledge about the Hamiltonian or the order parameter. We find that the most promising algorithms are principal component analysis and variational autoencoders. Their predicted latent parameters correspond to the known order parameters. The latent representations of the models in question are clustered, which makes it possible to identify phases without prior knowledge of their existence. Furthermore, we find that the reconstruction loss function can be used as a universal identifier for phase transitions.

333 citations


Journal ArticleDOI
TL;DR: Results indicate that the proposed model has the potential to obtain a reliable classification of motor imagery EEG signals, and can thus be used as a practical system for controlling a wheelchair.

320 citations


Journal ArticleDOI
TL;DR: A novel dynamic PCA (DiPCA) algorithm is proposed to extract explicitly a set of dynamic latent variables with which to capture the most dynamic variations in the data.

297 citations


Journal ArticleDOI
TL;DR: A group of hypothesis tests are performed to show that combining the ANNs with the PCA gives slightly higher classification accuracy than the other two combinations, and that the trading strategies guided by the comprehensive classification mining procedures based on PCA and ANNs gain significantly higher risk-adjusted profits than the comparison benchmarks.
Abstract: A data mining procedure to forecast daily stock market return is proposed.The raw data includes 60 financial and economic features over a 10-year period.Combining ANNs with PCA gives slightly higher classification accuracy.Combining ANNs with PCA provides significantly higher risk-adjusted profits. In financial markets, it is both important and challenging to forecast the daily direction of the stock market return. Among the few studies that focus on predicting daily stock market returns, the data mining procedures utilized are either incomplete or inefficient, especially when a large amount of features are involved. This paper presents a complete and efficient data mining process to forecast the daily direction of the S&P 500 Index ETF (SPY) return based on 60 financial and economic features. Three mature dimensionality reduction techniques, including principal component analysis (PCA), fuzzy robust principal component analysis (FRPCA), and kernel-based principal component analysis (KPCA) are applied to the whole data set to simplify and rearrange the original data structure. Corresponding to different levels of the dimensionality reduction, twelve new data sets are generated from the entire cleaned data using each of the three different dimensionality reduction methods. Artificial neural networks (ANNs) are then used with the thirty-six transformed data sets for classification to forecast the daily direction of future market returns. Moreover, the three different dimensionality reduction methods are compared with respect to the natural data set. A group of hypothesis tests are then performed over the classification and simulation results to show that combining the ANNs with the PCA gives slightly higher classification accuracy than the other two combinations, and that the trading strategies guided by the comprehensive classification mining procedures based on PCA and ANNs gain significantly higher risk-adjusted profits than the comparison benchmarks, while also being slightly higher than those strategies guided by the forecasts based on the FRPCA and KPCA models.

272 citations


Journal ArticleDOI
TL;DR: It is demonstrated that quantified principal components from PCA not only allow the exploration of different phases and symmetry-breaking, but they can distinguish phase-transition types and locate critical points in frustrated models such as the triangular antiferromagnet.
Abstract: We apply unsupervised machine learning techniques, mainly principal component analysis (PCA), to compare and contrast the phase behavior and phase transitions in several classical spin models-the square- and triangular-lattice Ising models, the Blume-Capel model, a highly degenerate biquadratic-exchange spin-1 Ising (BSI) model, and the two-dimensional XY model-and we examine critically what machine learning is teaching us. We find that quantified principal components from PCA not only allow the exploration of different phases and symmetry-breaking, but they can distinguish phase-transition types and locate critical points. We show that the corresponding weight vectors have a clear physical interpretation, which is particularly interesting in the frustrated models such as the triangular antiferromagnet, where they can point to incipient orders. Unlike the other well-studied models, the properties of the BSI model are less well known. Using both PCA and conventional Monte Carlo analysis, we demonstrate that the BSI model shows an absence of phase transition and macroscopic ground-state degeneracy. The failure to capture the "charge" correlations (vorticity) in the BSI model (XY model) from raw spin configurations points to some of the limitations of PCA. Finally, we employ a nonlinear unsupervised machine learning procedure, the "autoencoder method," and we demonstrate that it too can be trained to capture phase transitions and critical points.

264 citations


Journal ArticleDOI
TL;DR: A novel hybrid model based on principal component analysis (PCA) and least squares support vector machine (LSSVM) optimized by cuckoo search (CS) that outperforms a single LSSVM model with default parameters and a general regression neural network (GRNN) model in PM2.5 concentration prediction.

193 citations


Journal ArticleDOI
TL;DR: A novel approach called joint sparse principal component analysis (JSPCA) is proposed to jointly select useful features and enhance robustness to outliers and the experimental results demonstrate that the proposed approach is feasible and effective.

174 citations


Journal ArticleDOI
TL;DR: These results are a natural extension of those in Paul (2007) to a more general setting and solve the rates of convergence problems in Shen et al. (2013) and lead to a new covariance estimator for the approximate factor model, called shrinkage principal orthogonal complement thresholding (S-POET), that corrects the biases.
Abstract: We derive the asymptotic distributions of the spiked eigenvalues and eigenvectors under a generalized and unified asymptotic regime, which takes into account the magnitude of spiked eigenvalues, sample size, and dimensionality. This regime allows high dimensionality and diverging eigenvalues and provides new insights into the roles that the leading eigenvalues, sample size, and dimensionality play in principal component analysis. Our results are a natural extension of those in Paul (2007) to a more general setting and solve the rates of convergence problems in Shen et al. (2013). They also reveal the biases of estimating leading eigenvalues and eigenvectors by using principal component analysis, and lead to a new covariance estimator for the approximate factor model, called shrinkage principal orthogonal complement thresholding (S-POET), that corrects the biases. Our results are successfully applied to outstanding problems in estimation of risks of large portfolios and false discovery proportions for dependent test statistics and are illustrated by simulation studies.

158 citations


Journal ArticleDOI
TL;DR: The ability for t-SNE to reveal population stratification at different scales could be useful for human genetic association studies.
Abstract: The t-distributed stochastic neighbor embedding t-SNE is a new dimension reduction and visualization technique for high-dimensional data. t-SNE is rarely applied to human genetic data, even though it is commonly used in other data-intensive biological fields, such as single-cell genomics. We explore the applicability of t-SNE to human genetic data and make these observations: (i) similar to previously used dimension reduction techniques such as principal component analysis (PCA), t-SNE is able to separate samples from different continents; (ii) unlike PCA, t-SNE is more robust with respect to the presence of outliers; (iii) t-SNE is able to display both continental and sub-continental patterns in a single plot. We conclude that the ability for t-SNE to reveal population stratification at different scales could be useful for human genetic association studies.

146 citations


Journal ArticleDOI
TL;DR: CoP is the first robust PCA algorithm that is simultaneously non-iterative, provably robust to both unstructuring and structured outliers, and can tolerate a large number of unstructured outliers.
Abstract: This paper presents a remarkably simple, yet powerful, algorithm termed coherence pursuit (CoP) to robust principal component analysis (PCA). As inliers lie in a low-dimensional subspace and are mostly correlated, an inlier is likely to have strong mutual coherence with a large number of data points. By contrast, outliers either do not admit low-dimensional structures or form small clusters. In either case, an outlier is unlikely to bear strong resemblance to a large number of data points. Given that, CoP sets an outlier apart from an inlier by comparing their coherence with the rest of the data points. The mutual coherences are computed by forming the Gram matrix of the normalized data points. Subsequently, the sought subspace is recovered from the span of the subset of the data points that exhibit strong coherence with the rest of the data. As CoP only involves one simple matrix multiplication, it is significantly faster than the state-of-the-art robust PCA algorithms. We derive analytical performance guarantees for CoP under different models for the distributions of inliers and outliers in both noise-free and noisy settings. CoP is the first robust PCA algorithm that is simultaneously non-iterative, provably robust to both unstructured and structured outliers, and can tolerate a large number of unstructured outliers.


Journal ArticleDOI
04 Aug 2017-Sensors
TL;DR: A multi-step trajectory clustering method that combines Dynamic Time Warping, a similarity measurement method, with Principal Component Analysis to decompose the obtained distance matrix and an automatic algorithm for choosing the k clusters is developed according to the similarity distance.
Abstract: The Shipboard Automatic Identification System (AIS) is crucial for navigation safety and maritime surveillance, data mining and pattern analysis of AIS information have attracted considerable attention in terms of both basic research and practical applications. Clustering of spatio-temporal AIS trajectories can be used to identify abnormal patterns and mine customary route data for transportation safety. Thus, the capacities of navigation safety and maritime traffic monitoring could be enhanced correspondingly. However, trajectory clustering is often sensitive to undesirable outliers and is essentially more complex compared with traditional point clustering. To overcome this limitation, a multi-step trajectory clustering method is proposed in this paper for robust AIS trajectory clustering. In particular, the Dynamic Time Warping (DTW), a similarity measurement method, is introduced in the first step to measure the distances between different trajectories. The calculated distances, inversely proportional to the similarities, constitute a distance matrix in the second step. Furthermore, as a widely-used dimensional reduction method, Principal Component Analysis (PCA) is exploited to decompose the obtained distance matrix. In particular, the top k principal components with above 95% accumulative contribution rate are extracted by PCA, and the number of the centers k is chosen. The k centers are found by the improved center automatically selection algorithm. In the last step, the improved center clustering algorithm with k clusters is implemented on the distance matrix to achieve the final AIS trajectory clustering results. In order to improve the accuracy of the proposed multi-step clustering algorithm, an automatic algorithm for choosing the k clusters is developed according to the similarity distance. Numerous experiments on realistic AIS trajectory datasets in the bridge area waterway and Mississippi River have been implemented to compare our proposed method with traditional spectral clustering and fast affinity propagation clustering. Experimental results have illustrated its superior performance in terms of quantitative and qualitative evaluations.

Journal ArticleDOI
TL;DR: This study applies PCA technique in Shakkar River Catchment for redundancy of morphometric parameters and finds the more effective parameters for prioritization of the watershed and discusses the comparison between Gajbhiye et al. (Appl Water Sci 4(1):51–61, 2013b) and the present prioritization scheme.
Abstract: Remote sensing (RS) and Geographic Information Systems (GIS) techniques have become very important these days as they aid planners and decision makers to make effective and correct decisions and designs. Principal Component Analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables. It reduces the dimensionality of the data set and identifies a new meaningful underlying variable. Morphometric analysis and prioritization of the sub-watersheds of Shakkar River Catchment, Narsinghpur district in Madhya Pradesh State, India, is carried out using RS and GIS techniques as discussed in Gajbhiye et al. (Appl Water Sci 4(1):51–61, 2013b). In this study we apply PCA technique in Shakkar River Catchment for redundancy of morphometric parameters and find the more effective parameters for prioritization of the watershed and discuss the comparison between Gajbhiye et al. (Appl Water Sci 4(1):51–61, 2013b) and the present prioritization scheme.

Journal ArticleDOI
TL;DR: Compared with existing multivariate statistical process monitoring approaches such as principal component analysis (PCA) and its variants, the superior detectability of RTCSA is illustrated by a numerical example and the Tennessee Eastman process.

Journal ArticleDOI
TL;DR: In this article, the covariance matrix of a large portfolio of US equities is well represented by a low rank common structure with sparse residual matrix, and the proposed estimator largely outperforms the sample covariance estimator.

Journal ArticleDOI
TL;DR: A method to select the number of non-zero loadings in each PC while using SPCA is introduced, which considerably improves the interpretability of PCs while minimizing the loss of total variance explained.

Journal ArticleDOI
TL;DR: The EM-PCA and the PPCA assisted K-Means algorithm to accomplish the best clustering performance in the majority as well as achieving significant results with both clustering algorithms for all size of T1w MRI images.

Journal ArticleDOI
Ce Wang1, Hui Zhai1
TL;DR: This work feeds the compute with data generated by the classical Monte Carlo simulation for the XY model in frustrated triangular and union jack lattices, which has two order parameters and exhibits two phase transitions and shows that the outputs of the principle component analysis agree very well with the understanding of different orders in different phases.
Abstract: This work aims at determining whether artificial intelligence can recognize a phase transition without prior human knowledge. If this were successful, it could be applied to, for instance, analyzing data from the quantum simulation of unsolved physical models. Toward this goal, we first need to apply the machine learning algorithm to well-understood models and see whether the outputs are consistent with our prior knowledge, which serves as the benchmark for this approach. In this work, we feed the computer data generated by the classical Monte Carlo simulation for the $XY$ model in frustrated triangular and union jack lattices, which has two order parameters and exhibits two phase transitions. We show that the outputs of the principal component analysis agree very well with our understanding of different orders in different phases, and the temperature dependences of the major components detect the nature and the locations of the phase transitions. Our work offers promise for using machine learning techniques to study sophisticated statistical models, and our results can be further improved by using principal component analysis with kernel tricks and the neural network method.

Journal ArticleDOI
TL;DR: In this article, independent components are estimated by combining a nonparametric probability integral transformation with a generalized non-parametric whitening method based on distance covariance that simultaneously minimizes all forms of dependence among the components.
Abstract: This article introduces a novel statistical framework for independent component analysis (ICA) of multivariate data. We propose methodology for estimating mutually independent components, and a versatile resampling-based procedure for inference, including misspecification testing. Independent components are estimated by combining a nonparametric probability integral transformation with a generalized nonparametric whitening method based on distance covariance that simultaneously minimizes all forms of dependence among the components. We prove the consistency of our estimator under minimal regularity conditions and detail conditions for consistency under model misspecification, all while placing assumptions on the observations directly, not on the latent components. U statistics of certain Euclidean distances between sample elements are combined to construct a test statistic for mutually independent components. The proposed measures and tests are based on both necessary and sufficient conditions for...

Journal ArticleDOI
TL;DR: This paper proposes a non-greedy iterative algorithm to solve the trace ratio form of L1-norm-based linear discriminant analysis and demonstrates that the proposed algorithm can maximize the objective function value and is superior to most existing L 1-LDA algorithms.
Abstract: Recently, L1-norm-based discriminant subspace learning has attracted much more attention in dimensionality reduction and machine learning However, most existing approaches solve the column vectors of the optimal projection matrix one by one with greedy strategy Thus, the obtained optimal projection matrix does not necessarily best optimize the corresponding trace ratio objective function, which is the essential criterion function for general supervised dimensionality reduction In this paper, we propose a non-greedy iterative algorithm to solve the trace ratio form of L1-norm-based linear discriminant analysis We analyze the convergence of our proposed algorithm in detail Extensive experiments on five popular image databases illustrate that our proposed algorithm can maximize the objective function value and is superior to most existing L1-LDA algorithms

Journal ArticleDOI
TL;DR: By taking a Bayesian probabilistic perspective, this work provides a number of insights into more efficient algorithms for optimisation and hyper-parameter tuning.
Abstract: Deep learning is a form of machine learning for nonlinear high dimensional pattern matching and prediction By taking a Bayesian probabilistic perspective, we provide a number of insights into more efficient algorithms for optimisation and hyper-parameter tuning Traditional high-dimensional data reduction techniques, such as principal component analysis (PCA), partial least squares (PLS), reduced rank regression (RRR), projection pursuit regression (PPR) are all shown to be shallow learners Their deep learning counterparts exploit multiple deep layers of data reduction which provide predictive performance gains Stochastic gradient descent (SGD) training optimisation and Dropout (DO) regularization provide estimation and variable selection Bayesian regularization is central to finding weights and connections in networks to optimize the predictive bias-variance trade-off To illustrate our methodology, we provide an analysis of international bookings on Airbnb Finally, we conclude with directions for future research

Journal ArticleDOI
TL;DR: The proposed approach uses a shorter computational alternative to estimate covariance matrix and Singular Value Decomposition to obtain the result of Principal Component Thermography (PCT) and ultimately segments the defects in the specimens applying color based K-medoids clustering approach.

Journal ArticleDOI
TL;DR: A modified CCA method with regularization is developed to extract correlation between process variables and quality variables and a new concurrent CCA (CCCA) modeling method withRegularization is proposed to exploit the variance and covariance in the process-specific and quality-specific spaces.

Journal ArticleDOI
TL;DR: A novel ensemble approach, namely rotation random forest via kernel principal component analysis (RoRF-KPCA), in which the original feature space is first randomly split into several subsets, and KPCA is performed on each subset to extract high order statistics.
Abstract: Random Forest (RF) is a widely used classifier to show a good performance of hyperspectral data classification. However, such performance could be improved by increasing the diversity that characterizes the ensemble architecture. In this paper, we propose a novel ensemble approach, namely rotation random forest via kernel principal component analysis (RoRF-KPCA). In particular, the original feature space is first randomly split into several subsets, and KPCA is performed on each subset to extract high order statistics. The obtained feature sets are merged and used as input to an RF classifier. Finally, the results achieved at each step are fused by a majority vote. Experimental analysis is conducted using real hyperspectral remote sensing images to evaluate the performance of the proposed method in comparison with RF, rotation forest, support vector machines, and RoRF-PCA. The obtained results demonstrate the effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: In this paper, a self-written software that applies Principal Component Analysis (PCA) on the measured spectrum to verify the possibility of objective auto-characterization of nanoparticles from their vibrational modes is presented.


Journal Article
TL;DR: PCA depends upon the eigen-decomposition of positive semi-definite matrices and upon the singular value decomposition (SVD) of rectangular matrices, and is determined by eigenvectors and eigenvalues.
Abstract: Principal component analysis (PCA) is a multivariate technique that analyzes a data table in which observations are described by several inter-correlated quantitative dependent variables. Its goal is to extract the important information from the statistical data to represent it as a set of new orthogonal variables called principal components, and to display the pattern of similarity between the observations and of the variables as points in spot maps. Mathematically, PCA depends upon the eigen-decomposition of positive semi-definite matrices and upon the singular value decomposition (SVD) of rectangular matrices. It is determined by eigenvectors and eigenvalues. Eigenvectors and eigenvalues are numbers and vectors associated to square matrices. Together they provide the eigen-decomposition of a matrix, which analyzes the structure of this matrix. Even though the eigen-decomposition does not exist for all square matrices, it has a particularly simple expression for matrices such as correlation, covariance, or cross-product matrices.

Journal ArticleDOI
TL;DR: In this paper, a methodology that combines principal component analysis (PCA) with adaptive neuro fuzzy based inference system (ANFIS) is proposed to model the nonlinear relationship between ground surface settlements induced by an earth pressure balanced TBM and the operational and geological parameters.

Journal ArticleDOI
TL;DR: In this article, the authors proposed distribution-based methods with exact type 1 error controls for hypothesis testing and construction of confidence intervals for signals in a noisy matrix with finite samples, assuming Gaussian noise, by utilizing a post-selection inference framework, and extending the approach of Taylor, Loftus and Tibshirani (2013) in a PCA setting.
Abstract: Principal component analysis (PCA) is a well-known tool in multivariate statistics. One significant challenge in using PCA is the choice of the number of principal components. In order to address this challenge, we propose distribution-based methods with exact type 1 error controls for hypothesis testing and construction of confidence intervals for signals in a noisy matrix with finite samples. Assuming Gaussian noise, we derive exact type 1 error controls based on the conditional distribution of the singular values of a Gaussian matrix by utilizing a post-selection inference framework, and extending the approach of [Taylor, Loftus and Tibshirani (2013)] in a PCA setting. In simulation studies, we find that our proposed methods compare well to existing approaches.