scispace - formally typeset
Search or ask a question

Showing papers on "Principal component analysis published in 2009"


Posted Content
TL;DR: A new online optimization algorithm is proposed, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems.
Abstract: Sparse coding--that is, modelling data vectors as sparse linear combinations of basis elements--is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set, adapting it to specific data. Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large datasets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems. A proof of convergence is presented, along with experiments with natural images and genomic data demonstrating that it leads to state-of-the-art performance in terms of speed and optimization for both small and large datasets.

2,256 citations


01 Jan 2009

2,204 citations


Proceedings Article
07 Dec 2009
TL;DR: It is proved that most matrices A can be efficiently and exactly recovered from most error sign-and-support patterns by solving a simple convex program, for which it is given a fast and provably convergent algorithm.
Abstract: Principal component analysis is a fundamental operation in computational data analysis, with myriad applications ranging from web search to bioinformatics to computer vision and image analysis. However, its performance and applicability in real scenarios are limited by a lack of robustness to outlying or corrupted observations. This paper considers the idealized "robust principal component analysis" problem of recovering a low rank matrix A from corrupted observations D = A + E. Here, the corrupted entries E are unknown and the errors can be arbitrarily large (modeling grossly corrupted observations common in visual and bioinformatic data), but are assumed to be sparse. We prove that most matrices A can be efficiently and exactly recovered from most error sign-and-support patterns by solving a simple convex program, for which we give a fast and provably convergent algorithm. Our result holds even when the rank of A grows nearly proportionally (up to a logarithmic factor) to the dimensionality of the observation space and the number of errors E grows in proportion to the total number of entries in the matrix. A by-product of our analysis is the first proportional growth results for the related problem of completing a low-rank matrix from a small fraction of its entries. Simulations and real-data examples corroborate the theoretical results, and suggest potential applications in computer vision.

1,479 citations


Book
17 Feb 2009
TL;DR: Introduction Chemoinformatics-Chemometrics-Statistics This Book Historical Remarks about Chemometrics Bibliography Starting Examples Univariate Statistics-A Reminder Multivariate Data Definitions Basic Preprocessing Covariance and Correlation Distances and Similarities Multivariate Outlier Identification Linear Latent Variables
Abstract: Introduction Chemoinformatics-Chemometrics-Statistics This Book Historical Remarks about Chemometrics Bibliography Starting Examples Univariate Statistics-A Reminder Multivariate Data Definitions Basic Preprocessing Covariance and Correlation Distances and Similarities Multivariate Outlier Identification Linear Latent Variables Summary Principal Component Analysis (PCA) Concepts Number of PCA Components Centering and Scaling Outliers and Data Distribution Robust PCA Algorithms for PCA Evaluation and Diagnostics Complementary Methods for Exploratory Data Analysis Examples Summary Calibration Concepts Performance of Regression Models Ordinary Least Squares Regression Robust Regression Variable Selection Principal Component Regression Partial Least Squares Regression Related Methods Examples Summary Classification Concepts Linear Classification Methods Kernel and Prototype Methods Classification Trees Artificial Neural Networks Support Vector Machine Evaluation Examples Summary Cluster Analysis Concepts Distance and Similarity Measures Partitioning Methods Hierarchical Clustering Methods Fuzzy Clustering Model-Based Clustering Cluster Validity and Clustering Tendency Measures Examples Summary Preprocessing Concepts Smoothing and Differentiation Multiplicative Signal Correction Mass Spectral Features Appendix 1: Symbols and Abbreviations Appendix 2: Matrix Algebra Appendix 3: Introduction to R Index References appear at the end of each chapter

1,003 citations


Journal ArticleDOI
TL;DR: A simple algorithm for selecting a subset of coordinates with largest sample variances is provided, and it is shown that if PCA is done on the selected subset, then consistency is recovered, even if p(n) ≫ n.
Abstract: Principal components analysis (PCA) is a classic method for the reduction of dimensionality of data in the form of n observations (or cases) of a vector with p variables. Contemporary datasets often have p comparable with or even much larger than n. Our main assertions, in such settings, are (a) that some initial reduction in dimensionality is desirable before applying any PCA-type search for principal modes, and (b) the initial reduction in dimensionality is best achieved by working in a basis in which the signals have a sparse representation. We describe a simple asymptotic model in which the estimate of the leading principal component vector via standard PCA is consistent if and only if p(n)/n → 0. We provide a simple algorithm for selecting a subset of coordinates with largest sample variances, and show that if PCA is done on the selected subset, then consistency is recovered, even if p(n) ≫ n.

937 citations


Journal ArticleDOI
TL;DR: A novel technique for unsupervised change detection in multitemporal satellite images using principal component analysis (PCA) and k-means clustering and Experimental results confirm the effectiveness of the proposed approach.
Abstract: In this letter, we propose a novel technique for unsupervised change detection in multitemporal satellite images using principal component analysis (PCA) and k-means clustering. The difference image is partitioned into h times h nonoverlapping blocks. S, S les h2, orthonormal eigenvectors are extracted through PCA of h times h nonoverlapping block set to create an eigenvector space. Each pixel in the difference image is represented with an S-dimensional feature vector which is the projection of h times h difference image data onto the generated eigenvector space. The change detection is achieved by partitioning the feature vector space into two clusters using k-means clustering with k = 2 and then assigning each pixel to the one of the two clusters by using the minimum Euclidean distance between the pixel's feature vector and mean feature vector of clusters. Experimental results confirm the effectiveness of the proposed approach.

817 citations


Journal ArticleDOI
TL;DR: An algorithm is presented that preferentially chooses columns and rows that exhibit high “statistical leverage” and exert a disproportionately large “influence” on the best low-rank fit of the data matrix, obtaining improved relative-error and constant-factor approximation guarantees in worst-case analysis, as opposed to the much coarser additive-error guarantees of prior work.
Abstract: Principal components analysis and, more generally, the Singular Value Decomposition are fundamental data analysis tools that express a data matrix in terms of a sequence of orthogonal or uncorrelated vectors of decreasing importance. Unfortunately, being linear combinations of up to all the data points, these vectors are notoriously difficult to interpret in terms of the data and processes generating the data. In this article, we develop CUR matrix decompositions for improved data analysis. CUR decompositions are low-rank matrix decompositions that are explicitly expressed in terms of a small number of actual columns and/or actual rows of the data matrix. Because they are constructed from actual data elements, CUR decompositions are interpretable by practitioners of the field from which the data are drawn (to the extent that the original data are). We present an algorithm that preferentially chooses columns and rows that exhibit high “statistical leverage” and, thus, in a very precise statistical sense, exert a disproportionately large “influence” on the best low-rank fit of the data matrix. By selecting columns and rows in this manner, we obtain improved relative-error and constant-factor approximation guarantees in worst-case analysis, as opposed to the much coarser additive-error guarantees of prior work. In addition, since the construction involves computing quantities with a natural and widely studied statistical interpretation, we can leverage ideas from diagnostic regression analysis to employ these matrix decompositions for exploratory data analysis.

815 citations


Journal ArticleDOI
TL;DR: The Gini importance of the random forest provided superior means for measuring feature relevance on spectral data, but – on an optimal subset of features – the regularized classifiers might be preferable over the random Forest classifier, in spite of their limitation to model linear dependencies only.
Abstract: Regularized regression methods such as principal component or partial least squares regression perform well in learning tasks on high dimensional spectral data, but cannot explicitly eliminate irrelevant features. The random forest classifier with its associated Gini feature importance, on the other hand, allows for an explicit feature elimination, but may not be optimally adapted to spectral data due to the topology of its constituent classification trees which are based on orthogonal splits in feature space. We propose to combine the best of both approaches, and evaluated the joint use of a feature selection based on a recursive feature elimination using the Gini importance of random forests' together with regularized classification methods on spectral data sets from medical diagnostics, chemotaxonomy, biomedical analytics, food science, and synthetically modified spectral data. Here, a feature selection using the Gini feature importance with a regularized classification by discriminant partial least squares regression performed as well as or better than a filtering according to different univariate statistical tests, or using regression coefficients in a backward feature elimination. It outperformed the direct application of the random forest classifier, or the direct application of the regularized classifiers on the full set of features. The Gini importance of the random forest provided superior means for measuring feature relevance on spectral data, but – on an optimal subset of features – the regularized classifiers might be preferable over the random forest classifier, in spite of their limitation to model linear dependencies only. A feature selection based on Gini importance, however, may precede a regularized linear classification to identify this optimal subset of features, and to earn a double benefit of both dimensionality reduction and the elimination of noise from the classification task.

726 citations


Journal ArticleDOI
TL;DR: In this paper, the effects of discreteness of the observed variables on principal component analysis (PCA) are reviewed and the statistical properties of the popular Filmer and Pritchett (2001) procedure are analyzed.
Abstract: The last several years have seen a growth in the number of publications in economics that use principal component analysis (PCA) in the area of welfare studies. This paper explores the ways discrete data can be incorporated into PCA. The effects of discreteness of the observed variables on the PCA are reviewed. The statistical properties of the popular Filmer and Pritchett (2001) procedure are analyzed. The concepts of polychoric and polyserial correlations are introduced with appropriate references to the existing literature demonstrating their statistical properties. A large simulation study is carried out to compare various implementations of discrete data PCA. The simulation results show that the currently used method of running PCA on a set of dummy variables as proposed by Filmer and Pritchett (2001) can be improved upon by using procedures appropriate for discrete data, such as retaining the ordinal variables without breaking them into a set of dummy variables or using polychoric correlations. An empirical example using Bangladesh 2000 Demographic and Health Survey data helps in explaining the differences between procedures.

712 citations


Journal ArticleDOI
TL;DR: It is shown that ignoring phylogeny in preliminary transformations can result in significantly elevated variance and type I error in the authors' statistical estimators, even if subsequent analysis of the transformed data is performed using phylogenetic methods.
Abstract: Phylogenetic methods for the analysis of species data are widely used in evolutionary studies. However, preliminary data transformations and data reduction procedures (such as a size-correction and principal components analysis, PCA) are often performed without first correcting for nonindependence among the observations for species. In the present short comment and attached R and MATLAB code, I provide an overview of statistically correct procedures for phylogenetic size-correction and PCA. I also show that ignoring phylogeny in preliminary transformations can result in significantly elevated variance and type I error in our statistical estimators, even if subsequent analysis of the transformed data is performed using phylogenetic methods. This means that ignoring phylogeny during preliminary data transformations can possibly lead to spurious results in phylogenetic statistical analyses of species data.

698 citations


Journal ArticleDOI
01 May 2009-Oikos
TL;DR: Partial least squares regression analysis (PLSR) as mentioned in this paper is a statistical technique particularly well suited to analyzing a large array of related predictor variables (i.e. not truly independent), with a sample size not large enough compared to the number of independent variables, and in cases in which an attempt is made to approach complex phenomena or syndromes that must be defined as a combination of several variables obtained independently.
Abstract: This paper briefly presents the aims, requirements and results of partial least squares regression analysis (PLSR), and its potential utility in ecological studies. This statistical technique is particularly well suited to analyzing a large array of related predictor variables (i.e. not truly independent), with a sample size not large enough compared to the number of independent variables, and in cases in which an attempt is made to approach complex phenomena or syndromes that must be defined as a combination of several variables obtained independently. A simulation experiment is carried out to compare this technique with multiple regression (MR) and with a combination of principal component analysis and multiple regression (PCA+MR), varying the number of predictor variables and sample sizes. PLSR models explained a similar amount of variance to those results obtained by MR and PCA+MR. However, PLSR was more reliable than other techniques when identifying relevant variables and their magnitudes of influence, especially in cases of small sample size and low tolerance. Finally, we present one example of PLSR to illustrate its application and interpretation in ecology.

Proceedings ArticleDOI
01 Sep 2009
TL;DR: This paper describes a human detection method that augments widely used edge-based features with texture and color information, providing us with a much richer descriptor set, and is shown to outperform state-of-the-art techniques on three varied datasets.
Abstract: Significant research has been devoted to detecting people in images and videos. In this paper we describe a human detection method that augments widely used edge-based features with texture and color information, providing us with a much richer descriptor set. This augmentation results in an extremely high-dimensional feature space (more than 170,000 dimensions). In such high-dimensional spaces, classical machine learning algorithms such as SVMs are nearly intractable with respect to training. Furthermore, the number of training samples is much smaller than the dimensionality of the feature space, by at least an order of magnitude. Finally, the extraction of features from a densely sampled grid structure leads to a high degree of multicollinearity. To circumvent these data characteristics, we employ Partial Least Squares (PLS) analysis, an efficient dimensionality reduction technique, one which preserves significant discriminative information, to project the data onto a much lower dimensional subspace (20 dimensions, reduced from the original 170,000). Our human detection system, employing PLS analysis over the enriched descriptor set, is shown to outperform state-of-the-art techniques on three varied datasets including the popular INRIA pedestrian dataset, the low-resolution gray-scale DaimlerChrysler pedestrian dataset, and the ETHZ pedestrian dataset consisting of full-length videos of crowded scenes.

Journal ArticleDOI
Gil McVean1
TL;DR: For SNP data the projection of samples onto the principal components can be obtained directly from considering the average coalescent times between pairs of haploid genomes, which provides a framework for interpreting PCA projections in terms of underlying processes, including migration, geographical isolation, and admixture.
Abstract: Principal components analysis, PCA, is a statistical method commonly used in population genetics to identify structure in the distribution of genetic variation across geographical location and ethnic background. However, while the method is often used to inform about historical demographic processes, little is known about the relationship between fundamental demographic parameters and the projection of samples onto the primary axes. Here I show that for SNP data the projection of samples onto the principal components can be obtained directly from considering the average coalescent times between pairs of haploid genomes. The result provides a framework for interpreting PCA projections in terms of underlying processes, including migration, geographical isolation, and admixture. I also demonstrate a link between PCA and Wright's f(st) and show that SNP ascertainment has a largely simple and predictable effect on the projection of samples. Using examples from human genetics, I discuss the application of these results to empirical data and the implications for inference.

06 Apr 2009
TL;DR: This work proposes an image denoising method that ex- ploits nonlocal image modeling, principal component analysis (PCA), and local shape-adaptive anisotropic estimation and shows that the proposed method is competitive and outperforms some of the current best Denoising methods, especially in preserving image details and introducing very few artifacts.
Abstract: We propose an image denoising method that ex- ploits nonlocal image modeling, principal component analysis (PCA), and local shape-adaptive anisotropic estimation. The nonlocal modeling is exploited by grouping similar image patches in 3-D groups. The denoising is performed by shrinkage of the spectrum of a 3-D transform applied on such groups. The effectiveness of the shrinkage depends on the ability of the transform to sparsely represent the true-image data, thus separating it from the noise. We propose to improve the sparsity in two aspects. First, we employ image patches (neighborhoods) which can have data-adaptive shape. Second, we propose PCA on these adaptive-shape neighborhoods as part of the employed 3-D transform. The PCA bases are obtained by eigenvalue decompo- sition of empirical second-moment matrices that are estimated from groups of similar adaptive-shape neighborhoods. We show that the proposed method is competitive and outperforms some of the current best denoising methods, especially in preserving image details and introducing very few artifacts.

Journal ArticleDOI
TL;DR: This paper presents a generic and patient-specific classification system designed for robust and accurate detection of ECG heartbeat patterns that can adapt to significant interpatient variations in ECG patterns by training the optimal network structure, and achieves higher accuracy over larger datasets.
Abstract: This paper presents a generic and patient-specific classification system designed for robust and accurate detection of ECG heartbeat patterns. The proposed feature extraction process utilizes morphological wavelet transform features, which are projected onto a lower dimensional feature space using principal component analysis, and temporal features from the ECG data. For the pattern recognition unit, feedforward and fully connected artificial neural networks, which are optimally designed for each patient by the proposed multidimensional particle swarm optimization technique, are employed. By using relatively small common and patient-specific training data, the proposed classification system can adapt to significant interpatient variations in ECG patterns by training the optimal network structure, and thus, achieves higher accuracy over larger datasets. The classification experiments over a benchmark database demonstrate that the proposed system achieves such average accuracies and sensitivities better than most of the current state-of-the-art algorithms for detection of ventricular ectopic beats (VEBs) and supra-VEBs (SVEBs). Over the entire database, the average accuracy-sensitivity performances of the proposed system for VEB and SVEB detections are 98.3%-84.6% and 97.4%-63.5%, respectively. Finally, due to its parameter-invariant nature, the proposed system is highly generic, and thus, applicable to any ECG dataset.

Book
28 Sep 2009
TL;DR: This book presents a meta-analysis of Mouse Urine Spectroscopy for Salival Analysis of the Effect of Mouthwash, which highlights the importance of knowing the carrier and removal status of the gas molecule.
Abstract: Acknowledgements. Preface. 1 Introduction. 1.1 Past, Present and Future. 1.2 About this Book. Bibliography. 2 Case Studies. 2.1 Introduction. 2.2 Datasets, Matrices and Vectors. 2.3 Case Study 1: Forensic Analysis of Banknotes. 2.4 Case Study 2: Near Infrared Spectroscopic Analysis of Food. 2.5 Case Study 3: Thermal Analysis of Polymers. 2.6 Case Study 4: Environmental Pollution using Headspace Mass Spectrometry. 2.7 Case Study 5: Human Sweat Analysed by Gas Chromatography Mass Spectrometry. 2.8 Case Study 6: Liquid Chromatography Mass Spectrometry of Pharmaceutical Tablets. 2.9 Case Study 7: Atomic Spectroscopy for the Study of Hypertension. 2.10 Case Study 8: Metabolic Profiling of Mouse Urine by Gas Chromatography of Urine Extracts. 2.11 Case Study 9: Nuclear Magnetic Resonance Spectroscopy for Salival Analysis of the Effect of Mouthwash. 2.12 Case Study 10: Simulations. 2.13 Case Study 11: Null Dataset. 2.14 Case Study 12: GCMS and Microbiology of Mouse Scent Marks. Bibliography. 3 Exploratory Data Analysis. 3.1 Introduction. 3.2 Principal Components Analysis. 3.2.1 Background. 3.2.2 Scores and Loadings. 3.2.3 Eigenvalues. 3.2.4 PCA Algorithm. 3.2.5 Graphical Representation. 3.3 Dissimilarity Indices, Principal Co-ordinates Analysis and Ranking. 3.3.1 Dissimilarity. 3.3.2 Principal Co-ordinates Analysis. 3.3.3 Ranking. 3.4 Self Organizing Maps. 3.4.1 Background. 3.4.2 SOM Algorithm. 3.4.3 Initialization. 3.4.4 Training. 3.4.5 Map Quality. 3.4.6 Visualization. Bibliography. 4 Preprocessing. 4.1 Introduction. 4.2 Data Scaling. 4.2.1 Transforming Individual Elements. 4.2.2 Row Scaling. 4.2.3 Column Scaling. 4.3 Multivariate Methods of Data Reduction. 4.3.1 Largest Principal Components. 4.3.2 Discriminatory Principal Components. 4.3.3 Partial Least Squares Discriminatory Analysis Scores. 4.4 Strategies for Data Preprocessing. 4.4.1 Flow Charts. 4.4.2 Level 1. 4.4.3 Level 2. 4.4.4 Level 3. 4.4.5 Level 4. Bibliography. 5 Two Class Classifiers. 5.1 Introduction. 5.1.1 Two Class Classifiers. 5.1.2 Preprocessing. 5.1.3 Notation. 5.1.4 Autoprediction and Class Boundaries. 5.2 Euclidean Distance to Centroids. 5.3 Linear Discriminant Analysis. 5.4 Quadratic Discriminant Analysis. 5.5 Partial Least Squares Discriminant Analysis. 5.5.1 PLS Method. 5.5.2 PLS Algorithm. 5.5.3 PLS-DA. 5.6 Learning Vector Quantization. 5.6.1 Voronoi Tesselation and Codebooks. 5.6.2 LVQ1. 5.6.3 LVQ3. 5.6.4 LVQ Illustration and Summary of Parameters. 5.7 Support Vector Machines. 5.7.1 Linear Learning Machines. 5.7.2 Kernels. 5.7.3 Controlling Complexity and Soft Margin SVMs. 5.7.4 SVM Parameters. Bibliography. 6 One Class Classifiers. 6.1 Introduction. 6.2 Distance Based Classifiers. 6.3 PC Based Models and SIMCA. 6.4 Indicators of Significance. 6.4.1 Gaussian Density Estimators and Chi-Squared. 6.4.2 Hotelling's T 2 . 6.4.3 D-Statistic. 6.4.4 Q-Statistic or Squared Prediction Error. 6.4.5 Visualization of D- and Q-Statistics for Disjoint PC Models. 6.4.6 Multivariate Normality and What to do if it Fails. 6.5 Support Vector Data Description. 6.6 Summarizing One Class Classifiers. 6.6.1 Class Membership Plots. 6.6.2 ROC Curves. Bibliography. 7 Multiclass Classifiers. 7.1 Introduction. 7.2 EDC, LDA and QDA. 7.3 LVQ. 7.4 PLS. 7.4.1 PLS2. 7.4.2 PLS1. 7.5 SVM. 7.6 One against One Decisions. Bibliography. 8 Validation and Optimization. 8.1 Introduction. 8.1.1 Validation. 8.1.2 Optimization. 8.2 Classification Abilities, Contingency Tables and Related Concepts. 8.2.1 Two Class Classifiers. 8.2.2 Multiclass Classifiers. 8.2.3 One Class Classifiers. 8.3 Validation. 8.3.1 Testing Models. 8.3.2 Test and Training Sets. 8.3.3 Predictions. 8.3.4 Increasing the Number of Variables for the Classifier. 8.4 Iterative Approaches for Validation. 8.4.1 Predictive Ability, Model Stability, Classification by Majority Vote and Cross Classification Rate. 8.4.2 Number of Iterations. 8.4.3 Test and Training Set Boundaries. 8.5 Optimizing PLS Models. 8.5.1 Number of Components: Cross-Validation and Bootstrap. 8.5.2 Thresholds and ROC Curves. 8.6 Optimizing Learning Vector Quantization Models. 8.7 Optimizing Support Vector Machine Models. Bibliography. 9 Determining Potential Discriminatory Variables. 9.1 Introduction. 9.1.1 Two Class Distributions. 9.1.2 Multiclass Distributions. 9.1.3 Multilevel and Multiway Distributions. 9.1.4 Sample Sizes. 9.1.5 Modelling after Variable Reduction. 9.1.6 Preliminary Variable Reduction. 9.2 Which Variables are most Significant?. 9.2.1 Basic Concepts: Statistical Indicators and Rank. 9.2.2 T-Statistic and Fisher Weights. 9.2.3 Multiple Linear Regression, ANOVA and the F-Ratio. 9.2.4 Partial Least Squares. 9.2.5 Relationship between the Indicator Functions. 9.3 How Many Variables are Significant? 9.3.1 Probabilistic Approaches. 9.3.2 Empirical Methods: Monte Carlo. 9.3.3 Cost/Benefit of Increasing the Number of Variables. Bibliography. 10 Bayesian Methods and Unequal Class Sizes. 10.1 Introduction. 10.2 Contingency Tables and Bayes' Theorem. 10.3 Bayesian Extensions to Classifiers. Bibliography. 11 Class Separation Indices. 11.1 Introduction. 11.2 Davies Bouldin Index. 11.3 Silhouette Width and Modified Silhouette Width. 11.3.1 Silhouette Width. 11.3.2 Modified Silhouette Width. 11.4 Overlap Coefficient. Bibliography. 12 Comparing Different Patterns. 12.1 Introduction. 12.2 Correlation Based Methods. 12.2.1 Mantel Test. 12.2.2 R V Coefficient. 12.3 Consensus PCA. 12.4 Procrustes Analysis. Bibliography. Index.

Journal ArticleDOI
TL;DR: It turns out that the procedure using ilr‐transformed data and robust PCA delivers superior results to all other approaches, demonstrating that due to the compositional nature of geochemical data PCA should be carried out without an appropriate transformation.
Abstract: SUMMARY Compositional data (almost all data in geochemistry) are closed data, that is they usually sum up to a constant (e.g. weight percent, wt.%) and carry only relative information. Thus, the covariance structure of compositional data is strongly biased and results of many multivariate techniques become doubtful without a proper transformation of the data. The centred logratio transformation (clr) is often used to open closed data. However the transformed data do not have full rank following a logratio transformation and cannot be used for robust multivariate techniques like principal component analysis (PCA). Here we propose to use the isometric logratio transformation (ilr) instead. However, the ilr transformation has the disadvantage that the resulting new variables are no longer directly interpretable in terms of the originally entered variables. Here we propose a technique how the resulting scores and loadings of a robust PCA on ilr transformed data can be back-transformed and interpreted. The procedure is demonstrated using a real data set from regional geochemistry and compared to results from non-transformed and non-robust versions of PCA. It turns out that the procedure using ilr-transformed data and robust PCA delivers superior results to all other approaches. The examples demonstrate that due to the compositional nature of geochemical data PCA should not be carried out without an appropriate transformation. Furthermore a robust approach is preferable if the dataset contains outliers. Copyright © 2009 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this article, the authors describe an efficient algorithm for low-rank approximation of matrices that produces accuracy that is very close to the best possible accuracy, for matrices of arbitrary sizes.
Abstract: Principal component analysis (PCA) requires the computation of a low-rank approximation to a matrix containing the data being analyzed. In many applications of PCA, the best possible accuracy of any rank-deficient approximation is at most a few digits (measured in the spectral norm, relative to the spectral norm of the matrix being approximated). In such circumstances, efficient algorithms have not come with guarantees of good accuracy, unless one or both dimensions of the matrix being approximated are small. We describe an efficient algorithm for the low-rank approximation of matrices that produces accuracy that is very close to the best possible accuracy, for matrices of arbitrary sizes. We illustrate our theoretical results via several numerical examples.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a total projection to latent structures (T-PLS) algorithm, which divides the X-space into four parts instead of two parts in standard PLS.
Abstract: Partial least squares or projection to latent structures (PLS) has been used in multivariate statistical process monitoring similar to principal component analysis. Standard PLS often requires many components or latent variables (LVs), which contain variations orthogonal to Y and useless for predicting Y. Further, the X-residual of PLS usually has quite large variations, thus is not proper to monitor with the Q-statistic. To reduce false alarm and missing alarm rates of faults related to Y, a total projection to latent structures (T-PLS) algorithm is proposed in this article. The new structure divides the X-space into four parts instead of two parts in standard PLS. The properties of T-PLS are studied in detail, including its relationship to the orthogonal PLS. Further study shows the space decomposition on X-space induced by T-PLS. Fault detection policy is developed based on the T-PLS. Case studies on two simulation dectection policy is developed based on the T-PLS. Case studies on two simulation examples show the effectiveness of the T-PLS based fault detection methods.

Journal ArticleDOI
TL;DR: Fifteen physicochemical descriptors of side chains of the 20 natural and of 26 non-coded amino acids are compiled and simple methods for their evaluation described and it is shown that three principal components account for 66% of the total variance in the available set.
Abstract: Fifteen physicochemical descriptors of side chains of the 20 natural and of 26 non-coded amino acids are compiled and simple methods for their evaluation described. The relevance of these parameters to account for hydrophobic, steric, and electric properties of the side chains is assessed and their intercorrelation analyzed. It is shown that three principal components, one steric, one bulk, and one electric (electronic), account for 66% of the total variance in the available set. These parameters may prove to be useful for correlation studies in series of bioactive peptide analogues.

Journal ArticleDOI
TL;DR: A new reliable method called probabilistic principal component analysis (PPCA) is put forward to impute the missing flow volume data based on historical data mining to reduce the root-mean-square imputation error by at least 25%.
Abstract: The missing data problem greatly affects traffic analysis. In this paper, we put forward a new reliable method called probabilistic principal component analysis (PPCA) to impute the missing flow volume data based on historical data mining. First, we review the current missing data-imputation method and why it may fail to yield acceptable results in many traffic flow applications. Second, we examine the statistical properties of traffic flow volume time series. We show that the fluctuations of traffic flow are Gaussian type and that principal component analysis (PCA) can be used to retrieve the features of traffic flow. Third, we discuss how to use a robust PCA to filter out the abnormal traffic flow data that disturb the imputation process. Finally, we recall the theories of PPCA/Bayesian PCA-based imputation algorithms and compare their performance with some conventional methods, including the nearest/mean historical imputation methods and the local interpolation/regression methods. The experiments prove that the PPCA method provides significantly better performance than the conventional methods, reducing the root-mean-square imputation error by at least 25%.

Journal ArticleDOI
TL;DR: It is shown that the proposed algorithm outperformed PCA and DPCA both in terms of detection and diagnosis of faults.

Journal ArticleDOI
TL;DR: It is shown that temporal basis functions calculated by subjecting the training data to principal component analysis (PCA) can be used to constrain the reconstruction such that the temporal resolution is improved.
Abstract: The k-t broad-use linear acquisition speed-up technique (BLAST) has become widespread for reducing image acquisition time in dynamic MRI. In its basic form k-t BLAST speeds up the data acquisition by undersampling k-space over time (referred to as k-t space). The resulting aliasing is resolved in the Fourier reciprocal x-f space (x = spatial position, f = temporal frequency) using an adaptive filter derived from a low-resolution estimate of the signal covariance. However, this filtering process tends to increase the reconstruction error or lower the achievable acceleration factor. This is problematic in applications exhibiting a broad range of temporal frequencies such as free-breathing myocardial perfusion imaging. We show that temporal basis functions calculated by subjecting the training data to principal component analysis (PCA) can be used to constrain the reconstruction such that the temporal resolution is improved. The presented method is called k-t PCA.

Journal ArticleDOI
TL;DR: In this paper, Principal Component Analysis and Soft Independent Modeling of Class Analogy are employed to generate a model and predict the rock type of the samples, which appear to exploit the matrix effects associated with the chemistries of these 18 samples.

Journal ArticleDOI
TL;DR: Experimental results presented in this paper confirm the usefulness of the KPCA for the analysis of hyperspectral data and improve results in terms of accuracy.
Abstract: Kernel principal component analysis (KPCA) is investigated for feature extraction from hyperspectral remote sensing data. Features extracted using KPCA are classified using linear support vector machines. In one experiment, it is shown that kernel principal component features are more linearly separable than features extracted with conventional principal component analysis. In a second experiment, kernel principal components are used to construct the extended morphological profile (EMP). Classification results, in terms of accuracy, are improved in comparison to original approach which used conventional principal component analysis for constructing the EMP. Experimental results presented in this paper confirm the usefulness of the KPCA for the analysis of hyperspectral data. For the one data set, the overall classification accuracy increases from 79% to 96% with the proposed approach.

Proceedings ArticleDOI
04 Dec 2009
TL;DR: A high-accuracy human activity recognition system based on single tri-axis accelerometer for use in a naturalistic environment that exploits the discrete cosine transform, the Principal Component Analysis (PCA) and Support Vector Machine for classification human different activity.
Abstract: This paper developed a high-accuracy human activity recognition system based on single tri-axis accelerometer for use in a naturalistic environment. This system exploits the discrete cosine transform (DCT), the Principal Component Analysis (PCA) and Support Vector Machine (SVM) for classification human different activity. First, the effective features are extracted from accelerometer data using DCT. Next, feature dimension is reduced by PCA in DCT domain. After implementing the PCA, the most invariant and discriminating information for recognition is maintained. As a consequence, Multi-class Support Vector Machines is adopted to distinguish different human activities. Experiment results show that the proposed system achieves the best accuracy is 97.51%, which is better than other approaches.

Journal ArticleDOI
TL;DR: In this paper, the authors consider a spiked covariance model in which a base matrix is perturbed by adding a k-sparse maximal eigenvector, and analyze two computationally tractable methods for recovering the support set of this maximal eigvector, as follows: (a) a simple diagonal thresholding method, which transitions from success to failure as a function of the rescaled sample size θdia(n, p, k)=n/[k2log(p−k)]; and (b) a more sophisticated semidefinite programming
Abstract: Principal component analysis (PCA) is a classical method for dimensionality reduction based on extracting the dominant eigenvectors of the sample covariance matrix. However, PCA is well known to behave poorly in the “large p, small n” setting, in which the problem dimension p is comparable to or larger than the sample size n. This paper studies PCA in this high-dimensional regime, but under the additional assumption that the maximal eigenvector is sparse, say, with at most k nonzero components. We consider a spiked covariance model in which a base matrix is perturbed by adding a k-sparse maximal eigenvector, and we analyze two computationally tractable methods for recovering the support set of this maximal eigenvector, as follows: (a) a simple diagonal thresholding method, which transitions from success to failure as a function of the rescaled sample size θdia(n, p, k)=n/[k2log(p−k)]; and (b) a more sophisticated semidefinite programming (SDP) relaxation, which succeeds once the rescaled sample size θsdp(n, p, k)=n/[klog(p−k)] is larger than a critical threshold. In addition, we prove that no method, including the best method which has exponential-time complexity, can succeed in recovering the support if the order parameter θsdp(n, p, k) is below a threshold. Our results thus highlight an interesting trade-off between computational and statistical efficiency in high-dimensional inference.

Journal ArticleDOI
TL;DR: A matrix perturbation view of the "phase transition phenomenon," and a simple linear-algebra based derivation of the eigenvalue and eigenvector overlap in this asymptotic limit of finite sample PCA are presented.
Abstract: Principal component analysis (PCA) is a standard tool for dimensional reduction of a set of $n$ observations (samples), each with $p$ variables. In this paper, using a matrix perturbation approach, we study the nonasymptotic relation between the eigenvalues and eigenvectors of PCA computed on a finite sample of size $n$, and those of the limiting population PCA as $n\to\infty$. As in machine learning, we present a finite sample theorem which holds with high probability for the closeness between the leading eigenvalue and eigenvector of sample PCA and population PCA under a spiked covariance model. In addition, we also consider the relation between finite sample PCA and the asymptotic results in the joint limit $p,n\to\infty$, with $p/n=c$. We present a matrix perturbation view of the "phase transition phenomenon," and a simple linear-algebra based derivation of the eigenvalue and eigenvector overlap in this asymptotic limit. Moreover, our analysis also applies for finite $p,n$ where we show that although there is no sharp phase transition as in the infinite case, either as a function of noise level or as a function of sample size $n$, the eigenvector of sample PCA may exhibit a sharp "loss of tracking," suddenly losing its relation to the (true) eigenvector of the population PCA matrix. This occurs due to a crossover between the eigenvalue due to the signal and the largest eigenvalue due to noise, whose eigenvector points in a random direction.

Journal ArticleDOI
TL;DR: In this paper, the authors investigate the asymptotic behavior of the Principal Component (PC) directions in HDLSS data and show that if the first few eigenvalues of a population covariance matrix are large enough compared to the others, then the corresponding estimated PC directions are consistent or converge to the appropriate subspace (subspace consistency).
Abstract: Principal Component Analysis (PCA) is an important tool of dimension reduction especially when the dimension (or the number of variables) is very high. Asymptotic studies where the sample size is fixed, and the dimension grows (i.e. High Dimension, Low Sample Size (HDLSS)) are becoming increasingly relevant. We investigate the asymptotic behavior of the Principal Component (PC) directions. HDLSS asymptotics are used to study consistency, strong inconsistency and subspace consistency. We show that if the first few eigenvalues of a population covariance matrix are large enough compared to the others, then the corresponding estimated PC directions are consistent or converge to the appropriate subspace (subspace consistency) and most other PC directions are strongly inconsistent. Broad sets of sucient conditions for each of these cases are specified and the main theorem gives a catalogue of possible combinations. In preparation for these results, we show that the geometric representation of HDLSS data holds under general conditions, which includes a mixing condition and a broad range of sphericity measures of the covariance matrix.

Journal ArticleDOI
TL;DR: This work investigates the asymptotic behavior of the Principal Component (PC) directions and shows that if the first few eigenvalues of a population covariance matrix are large enough compared to the others, then the corresponding estimated PC directions are consistent or converge to the appropriate subspace (subspace consistency) and most otherPC directions are strongly inconsistent.
Abstract: Principal Component Analysis (PCA) is an important tool of dimension reduction especially when the dimension (or the number of variables) is very high. Asymptotic studies where the sample size is fixed, and the dimension grows [i.e., High Dimension, Low Sample Size (HDLSS)] are becoming increasingly relevant. We investigate the asymptotic behavior of the Principal Component (PC) directions. HDLSS asymptotics are used to study consistency, strong inconsistency and subspace consistency. We show that if the first few eigenvalues of a population covariance matrix are large enough compared to the others, then the corresponding estimated PC directions are consistent or converge to the appropriate subspace (subspace consistency) and most other PC directions are strongly inconsistent. Broad sets of sufficient conditions for each of these cases are specified and the main theorem gives a catalogue of possible combinations. In preparation for these results, we show that the geometric representation of HDLSS data holds under general conditions, which includes a $\rho$-mixing condition and a broad range of sphericity measures of the covariance matrix.