scispace - formally typeset
Search or ask a question

Showing papers on "Principal component analysis published in 2014"


Journal ArticleDOI
TL;DR: The paper focuses on the use of principal component analysis in typical chemometric areas but the results are generally applicable.
Abstract: Principal component analysis is one of the most important and powerful methods in chemometrics as well as in a wealth of other areas. This paper provides a description of how to understand, use, and interpret principal component analysis. The paper focuses on the use of principal component analysis in typical chemometric areas but the results are generally applicable.

1,622 citations


OtherDOI
29 Sep 2014
TL;DR: Principal component analysis (PCA) as mentioned in this paper replaces the p original variables by a smaller number, q, of derived variables, the principal components, which are linear combinations of the original variables.
Abstract: When large multivariate datasets are analyzed, it is often desirable to reduce their dimensionality. Principal component analysis is one technique for doing this. It replaces the p original variables by a smaller number, q, of derived variables, the principal components, which are linear combinations of the original variables. Often, it is possible to retain most of the variability in the original variables with q very much smaller than p. Despite its apparent simplicity, principal component analysis has a number of subtleties, and it has many uses and extensions. A number of choices associated with the technique are briefly discussed, namely, covariance or correlation, how many components, and different normalization constraints, as well as confusion with factor analysis. Various uses and extensions are outlined. Keywords: dimension reduction; factor analysis; multivariate analysis; variance maximization

1,268 citations


Book
06 Jul 2014
TL;DR: In this article, the authors present a survey of the main principles and controversies in multivariate analysis, including nonlinear principal components analysis, nonlinear generalized Canonical analysis, and nonlinear Canonical correlation analysis.
Abstract: Conventions and Controversies in Multivariate Analysis. Coding of Categorical Data. Homogeneity Analysis. Nonlinear Principal Components Analysis. Nonlinear Generalized Canonical Analysis. Nonlinear Canonical Correlation Analysis. Asymmetric Treatment of Sets: Some Special Cases, Some Future Programs. Multidimensional Scaling and Correspondence Analysis. Models as Gauges for the Analysis of Binary Data. Reflections on Restrictions. Nonlinear Multivariate Analysis: Principles and Possibilities. The Study of Stability. The Proof of the Pudding. Appendices. References. Author Index. Subject Index.

853 citations


Journal ArticleDOI
TL;DR: A linear-time algorithm applicable to a large class of trait evolution models, for efficient likelihood calculations and parameter inference on very large trees, which solves the traditional computational burden associated with two key terms, namely the determinant of the phylogenetic covariance matrix V and quadratic products involving the inverse of V.
Abstract: We developed a linear-time algorithm applicable to a large class of trait evolution models, for efficient likelihood calculations and parameter inference on very large trees. Our algorithm solves the traditional computational burden associated with two key terms, namely the determinant of the phylogenetic covariance matrix V and quadratic products involving the inverse of V. Applications include Gaussian models such as Brownian motion-derived models like Pagel's lambda, kappa, delta, and the early-burst model; Ornstein-Uhlenbeck models to account for natural selection with possibly varying selection parameters along the tree; as well as non-Gaussian models such as phylogenetic logistic regression, phylogenetic Poisson regression, and phylogenetic generalized linear mixed models. Outside of phylogenetic regression, our algorithm also applies to phylogenetic principal component analysis, phylogenetic discriminant analysis or phylogenetic prediction. The computational gain opens up new avenues for complex models or extensive resampling procedures on very large trees. We identify the class of models that our algorithm can handle as all models whose covariance matrix has a 3-point structure. We further show that this structure uniquely identifies a rooted tree whose branch lengths parametrize the trait covariance matrix, which acts as a similarity matrix. The new algorithm is implemented in the R package phylolm, including functions for phylogenetic linear regression and phylogenetic logistic regression.

728 citations


Journal ArticleDOI
TL;DR: A deep learning network (DLN) is proposed to discover unknown feature correlation between input signals that is crucial for the learning task and provides better performance compared to SVM and naive Bayes classifiers.
Abstract: Automatic emotion recognition is one of the most challenging tasks. To detect emotion from nonstationary EEG signals, a sophisticated learning algorithm that can represent high-level abstraction is required. This study proposes the utilization of a deep learning network (DLN) to discover unknown feature correlation between input signals that is crucial for the learning task. The DLN is implemented with a stacked autoencoder (SAE) using hierarchical feature learning approach. Input features of the network are power spectral densities of 32-channel EEG signals from 32 subjects. To alleviate overfitting problem, principal component analysis (PCA) is applied to extract the most important components of initial input features. Furthermore, covariate shift adaptation of the principal components is implemented to minimize the nonstationary effect of EEG signals. Experimental results show that the DLN is capable of classifying three different levels of valence and arousal with accuracy of 49.52% and 46.03%, respectively. Principal component based covariate shift adaptation enhances the respective classification accuracy by 5.55% and 6.53%. Moreover, DLN provides better performance compared to SVM and naive Bayes classifiers.

432 citations


Journal ArticleDOI
TL;DR: Applications of the t test, analysis of variance, principal component analysis and partial least squares discriminant analysis will be shown on both real and simulated metabolomics data examples to provide an overview on fundamental aspects of univariate and multivariate methods.
Abstract: Metabolomics experiments usually result in a large quantity of data. Univariate and multivariate analysis techniques are routinely used to extract relevant information from the data with the aim of providing biological knowledge on the problem studied. Despite the fact that statistical tools like the t test, analysis of variance, principal component analysis, and partial least squares discriminant analysis constitute the backbone of the statistical part of the vast majority of metabolomics papers, it seems that many basic but rather fundamental questions are still often asked, like: Why do the results of univariate and multivariate analyses differ? Why apply univariate methods if you have already applied a multivariate method? Why if I do not see something univariately I see something multivariately? In the present paper we address some aspects of univariate and multivariate analysis, with the scope of clarifying in simple terms the main differences between the two approaches. Applications of the t test, analysis of variance, principal component analysis and partial least squares discriminant analysis will be shown on both real and simulated metabolomics data examples to provide an overview on fundamental aspects of univariate and multivariate methods.

405 citations


Posted Content
TL;DR: Linear dimensionality reduction methods have been developed with a variety of names and motivations in many fields, and perhaps as a result the connections between all these methods have not been highlighted as discussed by the authors.
Abstract: Linear dimensionality reduction methods are a cornerstone of analyzing high dimensional data, due to their simple geometric interpretations and typically attractive computational properties. These methods capture many data features of interest, such as covariance, dynamical structure, correlation between data sets, input-output relationships, and margin between data classes. Methods have been developed with a variety of names and motivations in many fields, and perhaps as a result the connections between all these methods have not been highlighted. Here we survey methods from this disparate literature as optimization programs over matrix manifolds. We discuss principal component analysis, factor analysis, linear multidimensional scaling, Fisher's linear discriminant analysis, canonical correlations analysis, maximum autocorrelation factors, slow feature analysis, sufficient dimensionality reduction, undercomplete independent component analysis, linear regression, distance metric learning, and more. This optimization framework gives insight to some rarely discussed shortcomings of well-known methods, such as the suboptimality of certain eigenvector solutions. Modern techniques for optimization over matrix manifolds enable a generic linear dimensionality reduction solver, which accepts as input data and an objective to be optimized, and returns, as output, an optimal low-dimensional projection of the data. This simple optimization framework further allows straightforward generalizations and novel variants of classical methods, which we demonstrate here by creating an orthogonal-projection canonical correlations analysis. More broadly, this survey and generic solver suggest that linear dimensionality reduction can move toward becoming a blackbox, objective-agnostic numerical technology.

313 citations


Journal ArticleDOI
TL;DR: A novel denoising algorithm for photon-limited images which combines elements of dictionary learning and sparse patch-based representations of images and reveals that, despite its conceptual simplicity, Poisson PCA-based Denoising appears to be highly competitive in very low light regimes.
Abstract: Photon-limited imaging arises when the number of photons collected by a sensor array is small relative to the number of detector elements. Photon limitations are an important concern for many applications such as spectral imaging, night vision, nuclear medicine, and astronomy. Typically a Poisson distribution is used to model these observations, and the inherent heteroscedasticity of the data combined with standard noise removal methods yields significant artifacts. This paper introduces a novel denoising algorithm for photon-limited images which combines elements of dictionary learning and sparse patch-based representations of images. The method employs both an adaptation of Principal Component Analysis (PCA) for Poisson noise and recently developed sparsity-regularized convex optimization algorithms for photon-limited images. A comprehensive empirical evaluation of the proposed method helps characterize the performance of this approach relative to other state-of-the-art denoising methods. The results reveal that, despite its conceptual simplicity, Poisson PCA-based denoising appears to be highly competitive in very low light regimes.

289 citations


Journal ArticleDOI
09 Apr 2014-PLOS ONE
TL;DR: Flashpca as mentioned in this paper is a highly efficient implementation of principal component analysis (PCA) based on randomized algorithms which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time.
Abstract: Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.

284 citations


Journal ArticleDOI
TL;DR: The aim in this paper is to improve the general understanding of how PCA and DA process and display differential sensing data, which should lead to the ability to better interpret the final results.
Abstract: Statistical analysis techniques such as principal component analysis (PCA) and discriminant analysis (DA) have become an integral part of data analysis for differential sensing. These multivariate statistical tools, while extremely versatile and useful, are sometimes used as “black boxes”. Our aim in this paper is to improve the general understanding of how PCA and DA process and display differential sensing data, which should lead to the ability to better interpret the final results. With various sets of model data, we explore several topics, such as how to choose an appropriate number of hosts for an array, selectivity compared to cross-reactivity, when to add hosts, how to obtain the best visually representative plot of a data set, and when arrays are not necessary. We also include items at the end of the paper as general recommendations which readers can follow when using PCA or DA in a practical application. Through this paper we hope to present these statistical analysis methods in a manner such that chemists gain further insight into approaches that optimize the discriminatory power of their arrays.

269 citations


Posted Content
TL;DR: This work extends the idea of PCA to handle arbitrary data sets consisting of numerical, Boolean, categorical, ordinal, and other data types, and proposes several parallel algorithms for fitting generalized low rank models.
Abstract: Principal components analysis (PCA) is a well-known technique for approximating a tabular data set by a low rank matrix. Here, we extend the idea of PCA to handle arbitrary data sets consisting of numerical, Boolean, categorical, ordinal, and other data types. This framework encompasses many well known techniques in data analysis, such as nonnegative matrix factorization, matrix completion, sparse and robust PCA, $k$-means, $k$-SVD, and maximum margin matrix factorization. The method handles heterogeneous data sets, and leads to coherent schemes for compressing, denoising, and imputing missing entries across all data types simultaneously. It also admits a number of interesting interpretations of the low rank factors, which allow clustering of examples or of features. We propose several parallel algorithms for fitting generalized low rank models, and describe implementations and numerical results.

Journal ArticleDOI
TL;DR: The first model-based clustering algorithm for multivariate functional data is proposed, based on the assumption of normality of the principal component scores, and it ability to take into account the dependence among curves.

Proceedings ArticleDOI
31 May 2014
TL;DR: It is shown that the well-known, but misnamed, randomized response algorithm provides nearly optimal additive quality gap compared to the best possible singular subspace of A, and that when ATA has a large eigenvalue gap -- a reason often cited for PCA -- the quality improves significantly.
Abstract: We consider the problem of privately releasing a low dimensional approximation to a set of data records, represented as a matrix A in which each row corresponds to an individual and each column to an attribute. Our goal is to compute a subspace that captures the covariance of A as much as possible, classically known as principal component analysis (PCA). We assume that each row of A has e2 norm bounded by one, and the privacy guarantee is defined with respect to addition or removal of any single row. We show that the well-known, but misnamed, randomized response algorithm, with properly tuned parameters, provides nearly optimal additive quality gap compared to the best possible singular subspace of A. We further show that when ATA has a large eigenvalue gap -- a reason often cited for PCA -- the quality improves significantly. Optimality (up to logarithmic factors) is proved using techniques inspired by the recent work of Bun, Ullman, and Vadhan on applying Tardos's fingerprinting codes to the construction of hard instances for private mechanisms for 1-way marginal queries. Along the way we define a list culling game which may be of independent interest. By combining the randomized response mechanism with the well-known following the perturbed leader algorithm of Kalai and Vempala we obtain a private online algorithm with nearly optimal regret. The regret of our algorithm even outperforms all the previously known online non-private algorithms of this type. We achieve this better bound by, satisfyingly, borrowing insights and tools from differential privacy!

Journal ArticleDOI
TL;DR: Comprehensive results have indicated that the proposed Folded-PCA approach not only outperforms the conventional PCA but also the baseline approach where the whole feature sets are used.
Abstract: As a widely used approach for feature extraction and data reduction, Principal Components Analysis (PCA) suffers from high computational cost, large memory requirement and low efficacy in dealing with large dimensional datasets such as Hyperspectral Imaging (HSI). Consequently, a novel Folded-PCA is proposed, where the spectral vector is folded into a matrix to allow the covariance matrix to be determined more efficiently. With this matrix-based representation, both global and local structures are extracted to provide additional information for data classification. Moreover, both the computational cost and the memory requirement have been significantly reduced. Using Support Vector Machine (SVM) for classification on two well-known HSI datasets and one Synthetic Aperture Radar (SAR) dataset in remote sensing, quantitative results are generated for objective evaluations. Comprehensive results have indicated that the proposed Folded-PCA approach not only outperforms the conventional PCA but also the baseline approach where the whole feature sets are used.

Proceedings Article
21 Jun 2014
TL;DR: Both theoretical analysis and empirical studies demonstrate the proposed novel robust PCA objective functions with removing optimal mean automatically can more effectively reduce data dimensionality than previous robustPCA methods.
Abstract: Principal Component Analysis (PCA) is the most widely used unsupervised dimensionality reduction approach In recent research, several robust PCA algorithms were presented to enhance the robustness of PCA model However, the existing robust PCA methods incorrectly center the data using the l2-norm distance to calculate the mean, which actually is not the optimal mean due to the l1-norm used in the objective functions In this paper, we propose novel robust PCA objective functions with removing optimal mean automatically Both theoretical analysis and empirical studies demonstrate our new methods can more effectively reduce data dimensionality than previous robust PCA methods

Journal ArticleDOI
TL;DR: Experimental results revealed that Rotation Forest, especially with PCA transformation, could produce more accurate results than bagging, AdaBoost, and Random Forest, indicating that R rotation Forests are promising approaches for generating classifier ensemble of hyperspectral remote sensing.
Abstract: In this letter, an ensemble learning approach, Rotation Forest, has been applied to hyperspectral remote sensing image classification for the first time. The framework of Rotation Forest is to project the original data into a new feature space using transformation methods for each base classifier (decision tree), then the base classifier can train in different new spaces for the purpose of encouraging both individual accuracy and diversity within the ensemble simultaneously. Principal component analysis (PCA), maximum noise fraction, independent component analysis, and local Fisher discriminant analysis are introduced as feature transformation algorithms in the original Rotation Forest. The performance of Rotation Forest was evaluated based on several criteria: different data sets, sensitivity to the number of training samples, ensemble size and the number of features in a subset. Experimental results revealed that Rotation Forest, especially with PCA transformation, could produce more accurate results than bagging, AdaBoost, and Random Forest. They indicate that Rotation Forests are promising approaches for generating classifier ensemble of hyperspectral remote sensing.

Journal ArticleDOI
TL;DR: The key operation of MSPCA is to rewrite the MPCA into multilinear regression forms and relax it for sparse regression, which has the potential to outperform the existing PCA-based subspace learning algorithms.
Abstract: In this brief, multilinear sparse principal component analysis (MSPCA) is proposed for feature extraction from the tensor data. MSPCA can be viewed as a further extension of the classical principal component analysis (PCA), sparse PCA (SPCA) and the recently proposed multilinear PCA (MPCA). The key operation of MSPCA is to rewrite the MPCA into multilinear regression forms and relax it for sparse regression. Differing from the recently proposed MPCA, MSPCA inherits the sparsity from the SPCA and iteratively learns a series of sparse projections that capture most of the variation of the tensor data. Each nonzero element in the sparse projections is selected from the most important variables/factors using the elastic net. Extensive experiments on Yale, Face Recognition Technology face databases, and COIL-20 object database encoded the object images as second-order tensors, and Weizmann action database as third-order tensors demonstrate that the proposed MSPCA algorithm has the potential to outperform the existing PCA-based subspace learning algorithms.

Reference BookDOI
20 Nov 2014
TL;DR: Principal Component Analysis Multiple Correspondence Analysis Factor Analysis for Mixed Data Weighting groups of variables Comparing Clouds of Partial Individuals Factors Common to Different Groups of Variables Comparing groups of variable and Indscal Model Qualitative and Mixed Data Multiple Factor Analysis and Procrustes Analysis Hierarchical Multiple Factor analysis Matrix Calculus and Euclidean Vector Space Bibliography as discussed by the authors
Abstract: Principal Component Analysis Multiple Correspondence Analysis Factor Analysis for Mixed Data Weighting Groups of Variables Comparing Clouds of Partial Individuals Factors Common to Different Groups of Variables Comparing Groups of Variables and Indscal Model Qualitative and Mixed Data Multiple Factor Analysis and Procrustes Analysis Hierarchical Multiple Factor Analysis Matrix Calculus and Euclidean Vector Space Bibliography

Journal ArticleDOI
TL;DR: This work proposes to apply principal component analysis (PCA) for feature extraction prior to the change detection of changes in multidimensional unlabeled data and shows that feature extraction through PCA is beneficial, specifically for data with multiple balanced classes.
Abstract: When classifiers are deployed in real-world applications, it is assumed that the distribution of the incoming data matches the distribution of the data used to train the classifier. This assumption is often incorrect, which necessitates some form of change detection or adaptive classification. While there has been a lot of work on change detection based on the classification error monitored over the course of the operation of the classifier, finding changes in multidimensional unlabeled data is still a challenge. Here, we propose to apply principal component analysis (PCA) for feature extraction prior to the change detection. Supported by a theoretical example, we argue that the components with the lowest variance should be retained as the extracted features because they are more likely to be affected by a change. We chose a recently proposed semiparametric log-likelihood change detection criterion that is sensitive to changes in both mean and variance of the multidimensional distribution. An experiment with 35 datasets and an illustration with a simple video segmentation demonstrate the advantage of using extracted features compared to raw data. Further analysis shows that feature extraction through PCA is beneficial, specifically for data with multiple balanced classes.

Posted Content
TL;DR: This paper leverages randomness to design scalable new variants of nonlinear PCA and CCA and extends to key multivariate analysis tools such as spectral clustering or LDA.
Abstract: Classical methods such as Principal Component Analysis (PCA) and Canonical Correlation Analysis (CCA) are ubiquitous in statistics. However, these techniques are only able to reveal linear relationships in data. Although nonlinear variants of PCA and CCA have been proposed, these are computationally prohibitive in the large scale. In a separate strand of recent research, randomized methods have been proposed to construct features that help reveal nonlinear patterns in data. For basic tasks such as regression or classification, random features exhibit little or no loss in performance, while achieving drastic savings in computational requirements. In this paper we leverage randomness to design scalable new variants of nonlinear PCA and CCA; our ideas extend to key multivariate analysis tools such as spectral clustering or LDA. We demonstrate our algorithms through experiments on real-world data, on which we compare against the state-of-the-art. A simple R implementation of the presented algorithms is provided.

Journal ArticleDOI
TL;DR: In this article, the theoretical basis of principal component analysis (PCA) was reviewed and the behavior of PCA when testing for association between a SNP and correlated traits, and the power of various PCA-based strategies was compared when analyzing up to 100 correlated traits.
Abstract: Many human traits are highly correlated. This correlation can be leveraged to improve the power of genetic association tests to identify markers associated with one or more of the traits. Principal component analysis (PCA) is a useful tool that has been widely used for the multivariate analysis of correlated variables. PCA is usually applied as a dimension reduction method: the few top principal components (PCs) explaining most of total trait variance are tested for association with a predictor of interest, and the remaining components are not analyzed. In this study we review the theoretical basis of PCA and describe the behavior of PCA when testing for association between a SNP and correlated traits. We then use simulation to compare the power of various PCA-based strategies when analyzing up to 100 correlated traits. We show that contrary to widespread practice, testing only the top PCs often has low power, whereas combining signal across all PCs can have greater power. This power gain is primarily due to increased power to detect genetic variants with opposite effects on positively correlated traits and variants that are exclusively associated with a single trait. Relative to other methods, the combined-PC approach has close to optimal power in all scenarios considered while offering more flexibility and more robustness to potential confounders. Finally, we apply the proposed PCA strategy to the genome-wide association study of five correlated coagulation traits where we identify two candidate SNPs that were not found by the standard approach.

Journal ArticleDOI
TL;DR: In this paper, the effect of feature extraction and classification that caused by the kernel function and the different options of its parameters is discussed, and the effects of reducing dimension analysis and kernel principal component analysis are compared.

Journal Article
TL;DR: The minimizer and its subspace are interpreted as robust versions of the empirical inverse covariance and the PCA subspace respectively and compared with many other algorithms for robust PCA on synthetic and real data sets and demonstrate state-of-the-art speed and accuracy.
Abstract: We study the basic problem of robust subspace recovery. That is, we assume a data set that some of its points are sampled around a fixed subspace and the rest of them are spread in the whole ambient space, and we aim to recover the fixed underlying subspace. We first estimate "robust inverse sample covariance" by solving a convex minimization procedure; we then recover the subspace by the bottom eigenvectors of this matrix (their number correspond to the number of eigenvalues close to 0). We guarantee exact subspace recovery under some conditions on the underlying data. Furthermore, we propose a fast iterative algorithm, which linearly converges to the matrix minimizing the convex problem. We also quantify the effect of noise and regularization and discuss many other practical and theoretical issues for improving the subspace recovery in various settings. When replacing the sum of terms in the convex energy function (that we minimize) with the sum of squares of terms, we obtain that the new minimizer is a scaled version of the inverse sample covariance (when exists). We thus interpret our minimizer and its subspace (spanned by its bottom eigenvectors) as robust versions of the empirical inverse covariance and the PCA subspace respectively. We compare our method with many other algorithms for robust PCA on synthetic and real data sets and demonstrate state-of-the-art speed and accuracy.

Journal ArticleDOI
TL;DR: In this article, a combination of principal component analysis (PCA) and artificial neural networks (ANN) was developed to determine its predictive ability for the air pollutant index (API).
Abstract: This study focused on the pattern recognition of Malaysian air quality based on the data obtained from the Malaysian Department of Environment (DOE). Eight air quality parameters in ten monitoring stations in Malaysia for 7 years (2005–2011) were gathered. Principal component analysis (PCA) in the environmetric approach was used to identify the sources of pollution in the study locations. The combination of PCA and artificial neural networks (ANN) was developed to determine its predictive ability for the air pollutant index (API). The PCA has identified that CH4, NmHC, THC, O3, and PM10 are the most significant parameters. The PCA-ANN showed better predictive ability in the determination of API with fewer variables, with R 2 and root mean square error (RMSE) values of 0.618 and 10.017, respectively. The work has demonstrated the importance of historical data in sampling plan strategies to achieve desired research objectives, as well as to highlight the possibility of determining the optimum number of sampling parameters, which in turn will reduce costs and time of sampling.

Journal ArticleDOI
TL;DR: A fault-relevant principal component analysis (FPCA) algorithm is proposed for statistical modeling and process monitoring by using both normal and fault data and provides a detailed insight into the decomposition of the original normal process information from the fault- relevant perspective.

Journal ArticleDOI
TL;DR: This paper predicts the bandgaps of over 200 new chalcopyrite compounds for previously untested chemistries based on a model using the descriptors most related to bandgap using robust quantitative structure – activity relationship type models.

Journal ArticleDOI
TL;DR: By finding the best low-rank approximation of the data with respect to a transposable quadratic norm, the generalized least-square matrix decomposition (GMD), directly accounts for structural relationships and is demonstrated for dimension reduction, signal recovery, and feature selection with high-dimensional structured data.
Abstract: Variables in many big-data settings are structured, arising, for example, from measurements on a regular grid as in imaging and time series or from spatial-temporal measurements as in climate studies. Classical multivariate techniques ignore these structural relationships often resulting in poor performance. We propose a generalization of principal components analysis (PCA) that is appropriate for massive datasets with structured variables or known two-way dependencies. By finding the best low-rank approximation of the data with respect to a transposable quadratic norm, our decomposition, entitled the generalized least-square matrix decomposition (GMD), directly accounts for structural relationships. As many variables in high-dimensional settings are often irrelevant, we also regularize our matrix decomposition by adding two-way penalties to encourage sparsity or smoothness. We develop fast computational algorithms using our methods to perform generalized PCA (GPCA), sparse GPCA, and functional GPCA on ma...

Posted Content
TL;DR: New algorithms and analyses for distributed PCA are given which lead to improved communication and computational costs for k-means clustering and related problems, and a speedup of orders of magnitude is shown on real world data.
Abstract: We study the distributed computing setting in which there are multiple servers, each holding a set of points, who wish to compute functions on the union of their point sets. A key task in this setting is Principal Component Analysis (PCA), in which the servers would like to compute a low dimensional subspace capturing as much of the variance of the union of their point sets as possible. Given a procedure for approximate PCA, one can use it to approximately solve $\ell_2$-error fitting problems such as $k$-means clustering and subspace clustering. The essential properties of an approximate distributed PCA algorithm are its communication cost and computational efficiency for a given desired accuracy in downstream applications. We give new algorithms and analyses for distributed PCA which lead to improved communication and computational costs for $k$-means clustering and related problems. Our empirical study on real world data shows a speedup of orders of magnitude, preserving communication with only a negligible degradation in solution quality. Some of these techniques we develop, such as a general transformation from a constant success probability subspace embedding to a high success probability subspace embedding with a dimension and sparsity independent of the success probability, may be of independent interest.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a dual data-driven PCA/SIMCA (DD-SIMCA) approach to construct a two-level decision area with extreme and outlier thresholds, both in case of regular data set and in the presence of outliers.
Abstract: For the construction of a reliable decision area in the soft independent modeling by class analogy (SIMCA) method, it is necessary to analyze calibration data revealing the objects of special types such as extremes and outliers. For this purpose, a thorough statistical analysis of the scores and orthogonal distances is necessary. The distance values should be considered as any data acquired in the experiment, and their distributions are estimated by a data-driven method, such as a method of moments or similar. The scaled chi-squared distribution seems to be the first candidate among the others in such an assessment. This provides the possibility of constructing a two-level decision area, with the extreme and outlier thresholds, both in case of regular data set and in the presence of outliers. We suggest the application of classical principal component analysis (PCA) with further use of enhanced robust estimators both for the scaling factor and for the number of degrees of freedom. A special diagnostic tool called extreme plot is proposed for the analyses of calibration objects. Extreme objects play an important role in data analysis. These objects are a mandatory attribute of any data set. The advocated dual data-driven PCA/SIMCA (DD-SIMCA) approach has demonstrated a proper performance in the analysis of simulated and real-world data for both regular and contaminated cases. DD-SIMCA has also been compared with robust principal component analysis, which is a fully robust method. Copyright © 2013 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: The O-PCA method is shown to perform very well for history matching problems, and to provide models that capture the key sand–sand and sand–shale connectivities evident in the true model.
Abstract: A new approach based on principal component analysis (PCA) for the representation of complex geological models in terms of a small number of parameters is presented. The basis matrix required by the method is constructed from a set of prior geological realizations generated using a geostatistical algorithm. Unlike standard PCA-based methods, in which the high-dimensional model is constructed from a (small) set of parameters by simply performing a multiplication using the basis matrix, in this method the mapping is formulated as an optimization problem. This enables the inclusion of bound constraints and regularization, which are shown to be useful for capturing highly connected geological features and binary/bimodal (rather than Gaussian) property distributions. The approach, referred to as optimization-based PCA (O-PCA), is applied here mainly for binary-facies systems, in which case the requisite optimization problem is separable and convex. The analytical solution of the optimization problem, as well as the derivative of the model with respect to the parameters, is obtained analytically. It is shown that the O-PCA mapping can also be viewed as a post-processing of the standard PCA model. The O-PCA procedure is applied both to generate new (random) realizations and for gradient-based history matching. For the latter, two- and three-dimensional systems, involving channelized and deltaic-fan geological models, are considered. The O-PCA method is shown to perform very well for these history matching problems, and to provide models that capture the key sand–sand and sand–shale connectivities evident in the true model. Finally, the approach is extended to generate bimodal systems in which the properties of both facies are characterized by Gaussian distributions. MATLAB code with the O-PCA implementation, and examples demonstrating its use are provided online as Supplementary Materials.