scispace - formally typeset
Search or ask a question

Showing papers on "Principal component analysis published in 2015"


Journal Article
TL;DR: This survey and generic solver suggest that linear dimensionality reduction can move toward becoming a blackbox, objective-agnostic numerical technology.
Abstract: Linear dimensionality reduction methods are a cornerstone of analyzing high dimensional data, due to their simple geometric interpretations and typically attractive computational properties. These methods capture many data features of interest, such as covariance, dynamical structure, correlation between data sets, input-output relationships, and margin between data classes. Methods have been developed with a variety of names and motivations in many fields, and perhaps as a result the connections between all these methods have not been highlighted. Here we survey methods from this disparate literature as optimization programs over matrix manifolds. We discuss principal component analysis, factor analysis, linear multidimensional scaling, Fisher's linear discriminant analysis, canonical correlations analysis, maximum autocorrelation factors, slow feature analysis, sufficient dimensionality reduction, undercomplete independent component analysis, linear regression, distance metric learning, and more. This optimization framework gives insight to some rarely discussed shortcomings of well-known methods, such as the suboptimality of certain eigenvector solutions. Modern techniques for optimization over matrix manifolds enable a generic linear dimensionality reduction solver, which accepts as input data and an objective to be optimized, and returns, as output, an optimal low-dimensional projection of the data. This simple optimization framework further allows straightforward generalizations and novel variants of classical methods, which we demonstrate here by creating an orthogonal-projection canonical correlations analysis. More broadly, this survey and generic solver suggest that linear dimensionality reduction can move toward becoming a blackbox, objective-agnostic numerical technology.

430 citations


Proceedings ArticleDOI
14 Jun 2015
TL;DR: In this paper, the authors show how to approximate a data matrix A with a much smaller sketch ~A that can be used to solve a general class of constrained k-rank approximation problems to within (1+e) error.
Abstract: We show how to approximate a data matrix A with a much smaller sketch ~A that can be used to solve a general class of constrained k-rank approximation problems to within (1+e) error. Importantly, this class includes k-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just O(k) dimensions, we generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For k-means dimensionality reduction, we provide (1+e) relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only 'cover' a good subspace for A}, but can be used directly to compute this subspace.Finally, for k-means clustering, we show how to achieve a (9+e) approximation by Johnson-Lindenstrauss projecting data to just O(log k/e2) dimensions. This is the first result that leverages the specific structure of k-means to achieve dimension independent of input size and sublinear in k.

314 citations


Journal ArticleDOI
TL;DR: The problem of “selective inference” is described, which addresses the following challenge: Having mined a set of data to find potential associations, how do assess the strength of these associations?
Abstract: We describe the problem of “selective inference.” This addresses the following challenge: Having mined a set of data to find potential associations, how do we properly assess the strength of these associations? The fact that we have “cherry-picked”—searched for the strongest associations—means that we must set a higher bar for declaring significant the associations that we see. This challenge becomes more important in the era of big data and complex statistical modeling. The cherry tree (dataset) can be very large and the tools for cherry picking (statistical learning methods) are now very sophisticated. We describe some recent new developments in selective inference and illustrate their use in forward stepwise regression, the lasso, and principal components analysis.

304 citations


Journal ArticleDOI
TL;DR: This protocol provides a user-friendly pipeline and graphical user interface for data pre-processing and unmixing of pixel spectra into their contributing pure components by multivariate curve resolution–alternating least squares (MCR-ALS) analysis.
Abstract: Raman and Fourier transform IR (FTIR) microspectroscopic images of biological material (tissue sections) contain detailed information about their chemical composition. The challenge lies in identifying changes in chemical composition, as well as locating and assigning these changes to different conditions (pathology, anatomy, environmental or genetic factors). Multivariate data analysis techniques are ideal for decrypting such information from the data. This protocol provides a user-friendly pipeline and graphical user interface (GUI) for data pre-processing and unmixing of pixel spectra into their contributing pure components by multivariate curve resolution-alternating least squares (MCR-ALS) analysis. The analysis considers the full spectral profile in order to identify the chemical compounds and to visualize their distribution across the sample to categorize chemically distinct areas. Results are rapidly achieved (usually <30-60 min per image), and they are easy to interpret and evaluate both in terms of chemistry and biology, making the method generally more powerful than principal component analysis (PCA) or heat maps of single-band intensities. In addition, chemical and biological evaluation of the results by means of reference matching and segmentation maps (based on k-means clustering) is possible.

236 citations


Journal ArticleDOI
TL;DR: It is demonstrated precisely how using standard PCA can mislead inferences: the first few principal components of traits evolved under constant-rate multivariate Brownian motion will appear to have evolved via an "early burst" process.
Abstract: Most existing methods for modeling trait evolution are univariate, although researchers are often interested in investigating evolutionary patterns and processes across multiple traits. Principal components analysis (PCA) is commonly used to reduce the dimensionality of multivariate data so that univariate trait models can be fit to individual principal components. The problem with using standard PCA on phylogenetically structured data has been previously pointed out yet it continues to be widely used in the literature. Here we demonstrate precisely how using standard PCA can mislead inferences: The first few principal components of traits evolved under constant-rate multivariate Brownian motion will appear to have evolved via an "early burst" process. A phylogenetic PCA (pPCA) has been proprosed to alleviate these issues. However, when the true model of trait evolution deviates from the model assumed in the calculation of the pPCA axes, we find that the use of pPCA suffers from similar artifacts as standard PCA. We show that data sets with high effective dimensionality are particularly likely to lead to erroneous inferences. Ultimately, all of the problems we report stem from the same underlying issue—by considering only the first few principal components as univariate traits, we are effectively examining a biased sample of a multivariate pattern. These results highlight the need for truly multivariate phylogenetic comparative methods. As these methods are still being developed, we discuss potential alternative strategies for using and interpreting models fit to univariate axes of multivariate data. (Brownian motion; early burst; multivariate evolution; Ornstein-Uhlenbeck; phylogenetic comparative methods; principal components analysis; quantitative genetics)

212 citations


Journal ArticleDOI
TL;DR: The jackstraw is introduced, a new approach called the Jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs and can greatly simplify complex significance testing problems encountered in genomics.
Abstract: Motivation: There are a number of well-established methods such as principal component analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biological variable that is otherwise difficult to precisely define or model. An unsolved problem in this context is how to systematically identify the genomic variables that are drivers of systematic variation captured by PCA. Principal components (PCs) (and other estimates of systematic variation) are directly constructed from the genomic variables themselves, making measures of statistical significance artificially inflated when using conventional methods due to over-fitting. Results: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs. The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables. Using simulation, we demonstrate that our method attains accurate measures of statistical significance over a range of relevant scenarios. We consider yeast cell-cycle gene expression data, and show that the proposed method can be used to straightforwardly identify genes that are cell-cycle regulated with an accurate measure of statistical significance. We also analyze gene expression data from post-trauma patients, allowing the gene expression data to provide a molecularly driven phenotype. Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype. The proposed method provides a useful bridge between large-scale quantifications of systematic variation and gene-level significance analyses. Availability and implementation: An R software package, called jackstraw, is available in CRAN. Contact: ude.notecnirp@yerotsj

190 citations


Journal ArticleDOI
TL;DR: In this article, a dynamic version of functional principal component analysis based on a frequency domain approach is proposed, inspired by Brillinger's theory of dynamic principal components, which is used for time series of functional data (Xt:t∈Z).
Abstract: Summary We address the problem of dimension reduction for time series of functional data (Xt:t∈Z). Such functional time series frequently arise, for example, when a continuous time process is segmented into some smaller natural units, such as days. Then each Xt represents one intraday curve. We argue that functional principal component analysis, though a key technique in the field and a benchmark for any competitor, does not provide an adequate dimension reduction in a time series setting. Functional principal component analysis indeed is a static procedure which ignores the essential information that is provided by the serial dependence structure of the functional data under study. Therefore, inspired by Brillinger's theory of dynamic principal components, we propose a dynamic version of functional principal component analysis which is based on a frequency domain approach. By means of a simulation study and an empirical illustration, we show the considerable improvement that the dynamic approach entails when compared with the usual static procedure.

180 citations


Journal ArticleDOI
TL;DR: In this article, the authors considered a sparse spiked covariance matrix model in the high-dimensional setting and studied the minimax estimation of the covariance matrices and the principal subspace as well as minimax rank detection.
Abstract: This paper considers a sparse spiked covariance matrix model in the high-dimensional setting and studies the minimax estimation of the covariance matrix and the principal subspace as well as the minimax rank detection. The optimal rate of convergence for estimating the spiked covariance matrix under the spectral norm is established, which requires significantly different techniques from those for estimating other structured covariance matrices such as bandable or sparse covariance matrices. We also establish the minimax rate under the spectral norm for estimating the principal subspace, the primary object of interest in principal component analysis. In addition, the optimal rate for the rank detection boundary is obtained. This result also resolves the gap in a recent paper by Berthet and Rigollet (Ann Stat 41(4):1780–1815, 2013) where the special case of rank one is considered.

158 citations


Journal ArticleDOI
TL;DR: In this article, a component extrAction and sElection algorithm is proposed to filter interferograms relevant to the decorrelating scatterer, i.e., scatterers that may exhibit coherence losses depending on the spatial and temporal baseline distributions, and detect and separate scattering mechanisms possibly interfering in the same pixel due to layover directly at the interferogram generation stage.
Abstract: Synthetic aperture radar (SAR) tomography has been strongly developed in the last years for the analysis at fine scale of data acquired by high-resolution interferometric SAR sensors as a technique alternative to classical persistent scatterer interferometry and able to resolve also multiple scatterers. SqueeSAR is a recently proposed solution which, in the context of SAR interferometry at the coarse scale analysis stage, allows taking advantage of the multilook operation to filter interferometic stacks by extracting, pixel by pixel, equivalent scattering mechanisms from the set of all available interferometric measurement collected in the data covariance matrix. In this paper, we investigate the possibilities to extend SqueeSAR by allowing the identification of multiple scattering mechanisms from the analysis of the covariance matrix. In particular, we present a new approach, named “Component extrAction and sElection SAR” algorithm, that allows taking advantage of the principal component analysis to filter interferograms relevant to the decorrelating scatterer, i.e., scatterers that may exhibit coherence losses depending on the spatial and temporal baseline distributions, and to detect and separate scattering mechanisms possibly interfering in the same pixel due to layover directly at the interferogram generation stage. The proposed module allows providing options useful for classical interferometric processing to monitor ground deformations at lower resolution (coarse scale), as well as for possibly aiding the data calibration preliminary for the subsequent full-resolution interferometric/tomographic (fine scale) analysis. Results achieved by processing high-resolution Cosmo-SkyMed data, characterized by the favorable features of a large baseline span, are presented to explain the advantages and validate this new interferometric processing solution.

152 citations


Journal ArticleDOI
TL;DR: Experimental results show that the improved sparse subspace clustering method has the second shortest computational time and also outperforms the other six methods in classification accuracy when using an appropriate band number obtained by the DC plot algorithm.
Abstract: An improved sparse subspace clustering (ISSC) method is proposed to select an appropriate band subset for hyperspectral imagery (HSI) classification. The ISSC assumes that band vectors are sampled from a union of low-dimensional orthogonal subspaces and each band can be sparsely represented as a linear or affine combination of other bands within its subspace. First, the ISSC represents band vectors with sparse coefficient vectors by solving the L2-norm optimization problem using the least square regression (LSR) algorithm. The sparse and block diagonal structure of the coefficient matrix from LSR leads to correct segmentation of band vectors. Second, the angular similarity measurement is presented and utilized to construct the similarity matrix. Third, the distribution compactness (DC) plot algorithm is used to estimate an appropriate size of the band subset. Finally, spectral clustering is implemented to segment the similarity matrix and the desired ISSC band subset is found. Four groups of experiments on three widely used HSI datasets are performed to test the performance of ISSC for selecting bands in classification. In addition, the following six state-of-the-art band selection methods are used to make comparisons: linear constrained minimum variance-based band correlation constraint (LCMV-BCC), affinity propagation (AP), spectral information divergence (SID), maximum-variance principal component analysis (MVPCA), sparse representation-based band selection (SpaBS), and sparse nonnegative matrix factorization (SNMF). Experimental results show that the ISSC has the second shortest computational time and also outperforms the other six methods in classification accuracy when using an appropriate band number obtained by the DC plot algorithm.

152 citations


Journal ArticleDOI
TL;DR: The case where some of the data values are missing is studied and a review of methods which accommodate PCA to missing data is proposed and several techniques to consider or estimate (impute) missing values in PCA are presented.
Abstract: Principal component analysis (PCA) is a standard technique to summarize the main structures of a data table containing the measurements of several quantitative variables for a number of individuals. Here, we study the case where some of the data values are missing and propose a review of methods which accommodate PCA to missing data. In plant ecology, this statistical challenge relates to the current effort to compile global plant functional trait databases producing matrices with a large amount of missing values. We present several techniques to consider or estimate (impute) missing values in PCA and compare them using theoretical considerations. We carried out a simulation study to evaluate the relative merits of the different approaches in various situations (correlation structure, number of variables and individuals, and percentage of missing values) and also applied them on a real data set. Lastly, we discuss the advantages and drawbacks of these approaches, the potential pitfalls and future challenges that need to be addressed in the future.

Journal ArticleDOI
TL;DR: The PCA toolbox for MATLAB is described, which is a collection of modules for calculating Principal Component Analysis, as well as Cluster Analysis and Multidimensional Scaling, which are two other well-known multivariate methods for unsupervised data exploration.

Journal ArticleDOI
TL;DR: In this article, a 2D extension to singular spectrum analysis (2D-SSA), which is a recent technique for generic data mining and temporal signal analysis, is proposed for effective spatial information extraction.
Abstract: Feature extraction is of high importance for effective data classification in hyperspectral imaging (HSI). Considering the high correlation among band images, spectral-domain feature extraction is widely employed. For effective spatial information extraction, a 2-D extension to singular spectrum analysis (2D-SSA), which is a recent technique for generic data mining and temporal signal analysis, is proposed. With 2D-SSA applied to HSI, each band image is decomposed into varying trends, oscillations, and noise. Using the trend and the selected oscillations as features, the reconstructed signal, with noise highly suppressed, becomes more robust and effective for data classification. Three publicly available data sets for HSI remote sensing data classification are used in our experiments. Comprehensive results using a support vector machine classifier have quantitatively evaluated the efficacy of the proposed approach. Benchmarked with several state-of-the-art methods including 2-D empirical mode decomposition (2D-EMD), it is found that our proposed 2D-SSA approach generates the best results in most cases. Unlike 2D-EMD that requires sequential transforms to obtain detailed decomposition, 2D-SSA extracts all components simultaneously. As a result, the execution time in feature extraction can be also dramatically reduced. The superiority in terms of enhanced discrimination ability from 2D-SSA is further validated when a relatively weak classifier, i.e., the k-nearest neighbor, is used for data classification. In addition, the combination of 2D-SSA with 1-D principal component analysis (2D-SSA-PCA) has generated the best results among several other approaches, demonstrating the great potential in combining 2D-SSA with other approaches for effective spatial-spectral feature extraction and dimension reduction in HSI.

Journal ArticleDOI
TL;DR: This paper proposes a novel method for fusion of MS/HS and PAN images and of MS and HS images in the lower dimensional PC subspace, with substantially lower computational requirements and very high tolerance to noise in the observed data.
Abstract: In remote sensing, due to cost and complexity issues, multispectral (MS) and hyperspectral (HS) sensors have significantly lower spatial resolution than panchromatic (PAN) images. Recently, the problem of fusing coregistered MS and HS images has gained some attention. In this paper, we propose a novel method for fusion of MS/HS and PAN images and of MS and HS images. MS and, more so, HS images contain spectral redundancy, which makes the dimensionality reduction of the data via principal component (PC) analysis very effective. The fusion is performed in the lower dimensional PC subspace; thus, we only need to estimate the first few PCs, instead of every spectral reflectance band, and without compromising the spectral and spatial quality. The benefits of the approach are substantially lower computational requirements and very high tolerance to noise in the observed data. Examples are presented using WorldView 2 data and a simulated data set based on a real HS image, with and without added noise.

Journal ArticleDOI
TL;DR: Experimental results reveal that rotation forest ensembles are competitive with other strong supervised classification methods, such as support vector machines, and can improve the classification accuracies significantly, confirming the importance of spatial contextual information in hyperspectral spectral-spatial classification.
Abstract: In this paper, we propose a new spectral–spatial classification strategy to enhance the classification performances obtained on hyperspectral images by integrating rotation forests and Markov random fields (MRFs) First, rotation forests are performed to obtain the class probabilities based on spectral information Rotation forests create diverse base learners using feature extraction and subset features The feature set is randomly divided into several disjoint subsets; then, feature extraction is performed separately on each subset, and a new set of linear extracted features is obtained The base learner is trained with this set An ensemble of classifiers is constructed by repeating these steps several times The weak classifier of hyperspectral data, classification and regression tree (CART), is selected as the base classifier because it is unstable, fast, and sensitive to rotations of the axes In this case, small changes in the training data of CART lead to a large change in the results, generating high diversity within the ensemble Four feature extraction methods, including principal component analysis (PCA), neighborhood preserving embedding (NPE), linear local tangent space alignment (LLTSA), and linearity preserving projection (LPP), are used in rotation forests Second, spatial contextual information, which is modeled by MRF prior, is used to refine the classification results obtained from the rotation forests by solving a maximum a posteriori problem using the $\alpha$ -expansion graph cuts optimization method Experimental results, conducted on three hyperspectral data with different resolutions and different contexts, reveal that rotation forest ensembles are competitive with other strong supervised classification methods, such as support vector machines Rotation forests with local feature extraction methods, including NPE, LLTSA, and LPP, can lead to higher classification accuracies than that achieved by PCA With the help of MRF, the proposed algorithms can improve the classification accuracies significantly, confirming the importance of spatial contextual information in hyperspectral spectral–spatial classification

Posted Content
26 Nov 2015
TL;DR: In this article, the amplitude amplification together with the phase estimation algorithm is used to obtain the eigenvectors associated to the largest eigenvalues and so can be used to do principal component analysis on quantum computers.
Abstract: Principal component analysis is a multivariate statistical method frequently used in science and engineering to reduce the dimension of a problem or extract the most significant features from a dataset. In this paper, using a similar notion to the quantum counting, we show how to apply the amplitude amplification together with the phase estimation algorithm to an operator in order to procure the eigenvectors of the operator associated to the eigenvalues defined in the range $\left[a, b\right]$, where $a$ and $b$ are real and $0 \leq a \leq b \leq 1$. This makes possible to obtain a combination of the eigenvectors associated to the largest eigenvalues and so can be used to do principal component analysis on quantum computers.

Journal ArticleDOI
TL;DR: It is suggested that automatic quantifications can lead to shape spaces that are as meaningful as those based on observer landmarks, thereby presenting potential to save time in data collection, increase completeness of morphological quantification, eliminate observer error, and allow comparisons of shape diversity between different types of bones.
Abstract: Three-dimensional geometric morphometric (3DGM) methods for placing landmarks on digitized bones have become increasingly sophisticated in the last 20 years, including greater degrees of automation. One aspect shared by all 3DGM methods is that the researcher must designate initial landmarks. Thus, researcher interpretations of homology and correspondence are required for and influence representations of shape. We present an algorithm allowing fully automatic placement of correspondence points on samples of 3D digital models representing bones of different individuals/species, which can then be input into standard 3DGM software and analyzed with dimension reduction techniques. We test this algorithm against several samples, primarily a dataset of 106 primate calcanei represented by 1,024 correspondence points per bone. Results of our automated analysis of these samples are compared to a published study using a traditional 3DGM approach with 27 landmarks on each bone. Data were analyzed with morphologika 2.5 and PAST. Our analyses returned strong correlations between principal component scores, similar variance partitioning among components, and similarities between the shape spaces generated by the automatic and traditional methods. While cluster analyses of both automatically generated and traditional datasets produced broadly similar patterns, there were also differences. Overall these results suggest to us that automatic quantifications can lead to shape spaces that are as meaningful as those based on observer landmarks, thereby presenting potential to save time in data collection, increase completeness of morphological quantification, eliminate observer error, and allow comparisons of shape diversity between different types of bones. We provide an R package for implementing this analysis. Anat Rec, 298:249–276, 2015. V C 2014 Wiley Periodicals, Inc.

Journal ArticleDOI
TL;DR: Extensive Monte Carlo simulations demonstrate that the methods used in this paper have desirable finite-sample properties and outperform previous proposals.

Journal ArticleDOI
TL;DR: In the application, effects of age and BMI on the time‐specific change in probability of being active over a 24‐hour period are identified; in addition, the principal components analysis identifies the patterns of activity that distinguish subjects and days within subjects.
Abstract: This manuscript considers regression models for generalized, multilevel functional responses: functions are generalized in that they follow an exponential family distribution and multilevel in that they are clustered within groups or subjects. This data structure is increasingly common across scientific domains and is exemplified by our motivating example, in which binary curves indicating physical activity or inactivity are observed for nearly 600 subjects over 5 days. We use a generalized linear model to incorporate scalar covariates into the mean structure, and decompose subject-specific and subject-day-specific deviations using multilevel functional principal components analysis. Thus, functional fixed effects are estimated while accounting for within-function and within-subject correlations, and major directions of variability within and between subjects are identified. Fixed effect coefficient functions and principal component basis functions are estimated using penalized splines; model parameters are estimated in a Bayesian framework using Stan, a programming language that implements a Hamiltonian Monte Carlo sampler. Simulations designed to mimic the application have good estimation and inferential properties with reasonable computation times for moderate datasets, in both cross-sectional and multilevel scenarios; code is publicly available. In the application we identify effects of age and BMI on the time-specific change in probability of being active over a 24-hour period; in addition, the principal components analysis identifies the patterns of activity that distinguish subjects and days within subjects.

Proceedings Article
06 Jul 2015
TL;DR: The VR-PCA algorithm as discussed by the authors uses computationally cheap stochastic iterations, yet converges exponentially fast to the optimal solution, in contrast to existing algorithms that suffer either from slow convergence, or computationally intensive iterations whose runtime scales with the data size.
Abstract: We describe and analyze a simple algorithm for principal component analysis and singular value decomposition, VR-PCA, which uses computationally cheap stochastic iterations, yet converges exponentially fast to the optimal solution. In contrast, existing algorithms suffer either from slow convergence, or computationally intensive iterations whose runtime scales with the data size. The algorithm builds on a recent variance-reduced stochastic gradient technique, which was previously analyzed for strongly convex optimization, whereas here we apply it to an inherently non-convex problem, using a very different analysis.

Journal ArticleDOI
TL;DR: By taking the proposed model compared with the traditional backpropagation neural network (BPNN), PCA-BPNN and STNN, the empirical analysis shows that the forecasting results of the proposed neural network display a better performance in financial time series forecasting.

Posted Content
TL;DR: In this paper, the authors developed the necessary methodology to conduct principal component analysis at high frequency and constructed estimators of realized eigenvalues, eigenvectors, and principal components and provided the asymptotic distribution of these estimators.
Abstract: We develop the necessary methodology to conduct principal component analysis at high frequency. We construct estimators of realized eigenvalues, eigenvectors, and principal components and provide the asymptotic distribution of these estimators. Empirically, we study the high frequency covariance structure of the constituents of the S&P 100 Index using as little as one week of high frequency data at a time. The explanatory power of the high frequency principal components varies over time. During the recent financial crisis, the first principal component becomes increasingly dominant, explaining up to 60% of the variation on its own, while the second principal component drives the common variation of financial sector stocks.

Journal ArticleDOI
TL;DR: A new procedure that relies on eliminating wavelets that contribute to generate a large fourth-moment of the coefficient distribution to define "outliers" wavelets (kurtosis-based Wavelet Filtering, kbWF) is introduced.

Journal ArticleDOI
TL;DR: It is demonstrated that investigating the cross-covariance and theCross-correlation matrix between sphered and original variables allows to break the rotational invariance and to identify optimal whitening transformations.
Abstract: Whitening, or sphering, is a common preprocessing step in statistical analysis to transform random variables to orthogonality. However, due to rotational freedom there are infinitely many possible whitening procedures. Consequently, there is a diverse range of sphering methods in use, for example based on principal component analysis (PCA), Cholesky matrix decomposition and zero-phase component analysis (ZCA), among others. Here we provide an overview of the underlying theory and discuss five natural whitening procedures. Subsequently, we demonstrate that investigating the cross-covariance and the cross-correlation matrix between sphered and original variables allows to break the rotational invariance and to identify optimal whitening transformations. As a result we recommend two particular approaches: ZCA-cor whitening to produce sphered variables that are maximally similar to the original variables, and PCA-cor whitening to obtain sphered variables that maximally compress the original variables.

Posted Content
TL;DR: In this paper, the authors proposed to perform genome-wide scans of natural selection using principal component analysis (PCA) to detect genetic variants involved in local adaptation without any prior definition of populations, and showed that the common first index of genetic differentiation between populations can be viewed as a proportion of variance explained by the principal components.
Abstract: To characterize natural selection, various analytical methods for detecting candidate genomic regions have been developed. We propose to perform genome-wide scans of natural selection using principal component analysis. We show that the common Fst index of genetic differentiation between populations can be viewed as a proportion of variance explained by the principal components. Considering the correlations between genetic variants and each principal component provides a conceptual framework to detect genetic variants involved in local adaptation without any prior definition of populations. To validate the PCA-based approach, we consider the 1000 Genomes data (phase 1) after removal of recently admixed individuals resulting in 850 individuals coming from Africa, Asia, and Europe. The number of genetic variants is of the order of 36 millions obtained with a low-coverage sequencing depth (3X). The correlations between genetic variation and each principal component provide well-known targets for positive selection (EDAR, SLC24A5, SLC45A2, DARC), and also new candidate genes (APPBPP2, TP1A1, RTTN, KCNMA, MYO5C) and non-coding RNAs. In addition to identifying genes involved in biological adaptation, we identify two biological pathways involved in polygenic adaptation that are related to the innate immune system (beta defensins) and to lipid metabolism (fatty acid omega oxidation). An additional analysis of European data shows that a genome scan based on PCA retrieves classical examples of local adaptation even when there are no well-defined populations. PCA-based statistics, implemented in the PCAdapt R package and the PCAdapt open-source software, retrieve well-known signals of human adaptation, which is encouraging for future whole-genome sequencing project, especially when defining populations is difficult.

Journal ArticleDOI
TL;DR: It is concluded that using principal components analysis (PCA) to reduce the number of auxiliary variables is an effective and practical way to reap the benefits of the inclusive strategy in the presence of many possible auxiliary variables.
Abstract: To deal with missing data that arise due to participant nonresponse or attrition, methodologists have recommended an "inclusive" strategy where a large set of auxiliary variables are used to inform the missing data process. In practice, the set of possible auxiliary variables is often too large. We propose using principal components analysis (PCA) to reduce the number of possible auxiliary variables to a manageable number. A series of Monte Carlo simulations compared the performance of the inclusive strategy with eight auxiliary variables (inclusive approach) to the PCA strategy using just one principal component derived from the eight original variables (PCA approach). We examined the influence of four independent variables: magnitude of correlations, rate of missing data, missing data mechanism, and sample size on parameter bias, root mean squared error, and confidence interval coverage. Results indicate that the PCA approach results in unbiased parameter estimates and potentially more accuracy than the inclusive approach. We conclude that using the PCA strategy to reduce the number of auxiliary variables is an effective and practical way to reap the benefits of the inclusive strategy in the presence of many possible auxiliary variables.

Posted Content
TL;DR: A new model called 'Robust PCA on Graphs' is introduced which incorporates spectral graph regularization into the RobustPCA framework and outperforms 10 other state-of-the-art models in its clustering and low-rank recovery tasks.
Abstract: Principal Component Analysis (PCA) is the most widely used tool for linear dimensionality reduction and clustering. Still it is highly sensitive to outliers and does not scale well with respect to the number of data samples. Robust PCA solves the first issue with a sparse penalty term. The second issue can be handled with the matrix factorization model, which is however non-convex. Besides, PCA based clustering can also be enhanced by using a graph of data similarity. In this article, we introduce a new model called "Robust PCA on Graphs" which incorporates spectral graph regularization into the Robust PCA framework. Our proposed model benefits from 1) the robustness of principal components to occlusions and missing values, 2) enhanced low-rank recovery, 3) improved clustering property due to the graph smoothness assumption on the low-rank matrix, and 4) convexity of the resulting optimization problem. Extensive experiments on 8 benchmark, 3 video and 2 artificial datasets with corruptions clearly reveal that our model outperforms 10 other state-of-the-art models in its clustering and low-rank recovery tasks.

Journal ArticleDOI
TL;DR: A comprehensive review of the literature proposed to cope with high-dimensional and time-dependent features of statistical process monitoring, and a real-data example is presented to help the reader draw connections between the methods and the behavior they display.
Abstract: High-dimensional and time-dependent data pose significant challenges to statistical process monitoring. Dynamic principal-component analysis, recursive principal-component analysis, and moving-wind...

Journal ArticleDOI
TL;DR: A method for arrhythmia classification based on spectral correlation based on cyclostationary signal analysis approach, which explores hidden periodicities in the signal of interest and thus it is able to detect hidden features.
Abstract: A method for arrhythmia classification based on spectral correlation is proposed.Statistical features for the spectral correlation coefficients were calculated.Features were subjected to principal component analysis and fisher score.Raw spectral correlation data, PCA data and FS data were classified using SVM.The best performance is achieved using raw spectral correlation data. Cardiac disorders are one of the main causes leading to death. Therefore, they require continuous and efficient detection techniques. ECG is one of the main tools to diagnose cardiovascular disorders such as arrhythmias. Computer aided diagnosis (CAD) systems play a very important role in early detection and diagnosis of cardiac arrhythmias. In this work, we propose a CAD system for classifying five beat types including: normal (N), Premature Ventricular Contraction (PVC), Premature Atrial Contraction (APC), Left Bundle Branch Block (LBBB) and Right Bundle Branch Block (RBBB). The proposed system is based on cyclostationary signal analysis approach, which explores hidden periodicities in the signal of interest and thus it is able to detect hidden features. In order to study the cyclostationarity properties of the signal, we utilized the spectral correlation as a nonlinear statistical transformation inspecting the periodicity of the correlation. Three experiments were investigated in our study; raw spectral correlation data were used in the first experiment while the other two experiments utilized statistical features for the raw spectral data followed by principal component analysis (PCA) and fisher score for feature space reduction purposes respectively. As for the classification task, support vector machine (SVM) with linear kernel was employed for all experiments. The experimental results showed that the approach that uses the raw spectral correlation data is superior compared to several state of the art methods. This approach achieved sensitivity, specificity, accuracy, positive predictive value (PPV) and negative predictive value (NPV) of 99.20%, 99.70%, 98.60%, 99.90% and 97.60% respectively.

01 Jan 2015
TL;DR: It is shown that there is an effective sample size regime in which no randomised polynomial time algorithm can achieve the minimax optimal rate.
Abstract: In recent years, sparse principal component analysis has emerged as an extremely popular dimension reduction technique for high-dimensional data. The theoretical challenge, in the simplest case, is to estimate the leading eigenvector of a population covariance matrix under the assumption that this eigenvector is sparse. An impressive range of estimators have been proposed; some of these are fast to compute, while others are known to achieve the minimax optimal rate over certain Gaussian or sub-Gaussian classes. In this paper, we show that, under a widely-believed assumption from computational complexity theory, there is a fundamental trade-off between statistical and computational performance in this problem. More precisely, working with new, larger classes satisfying a restricted covariance concentration condition, we show that there is an effective sample size regime in which no randomised polynomial time algorithm can achieve the minimax optimal rate. We also study the theoretical performance of a (polynomial time) variant of the well-known semidefinite relaxation estimator, revealing a subtle interplay between statistical and computational efficiency.