scispace - formally typeset
Search or ask a question

Showing papers on "Principal component analysis published in 2008"


Journal ArticleDOI
Markus Ringnér1
TL;DR: Principal component analysis is often incorporated into genome-wide expression studies, but what is it and how can it be used to explore high-dimensional data?
Abstract: Principal component analysis is often incorporated into genome-wide expression studies, but what is it and how can it be used to explore high-dimensional data?

1,538 citations


Book
25 Aug 2008
TL;DR: In this paper, a short excursion into Matrix Algebra Moving to Higher Dimensions Multivariate Distributions Theory of the Multinormal Theory of Estimation Hypothesis Testing is described. But it is not discussed in detail.
Abstract: I Descriptive Techniques: Comparison of Batches.- II Multivariate Random Variables: A Short Excursion into Matrix Algebra Moving to Higher Dimensions Multivariate Distributions Theory of the Multinormal Theory of Estimation Hypothesis Testing.- III Multivariate Techniques: Decomposition of Data Matrices by Factors Principal Components Analysis Factor Analysis Cluster Analysis Discriminant Analysis.- Correspondence Analysis.- Canonical Correlation Analysis.- Multidimensional Scaling.- Conjoint Measurement Analysis.- Application in Finance.- Computationally Intensive Techniques.- A: Symbols and Notations.- B: Data.- Bibliography.- Index.

1,081 citations


Journal ArticleDOI
TL;DR: In this article, the authors show that the thresholded estimate is consistent in the operator norm as long as the true covariance matrix is sparse in a suitable sense, the variables are Gaussian or sub-Gaussian, and (log p)/n → 0, and obtain explicit rates.
Abstract: This paper considers regularizing a covariance matrix of p variables estimated from n observations, by hard thresholding. We show that the thresholded estimate is consistent in the operator norm as long as the true covariance matrix is sparse in a suitable sense, the variables are Gaussian or sub-Gaussian, and (log p)/n → 0, and obtain explicit rates. The results are uniform over families of covariance matrices which satisfy a fairly natural notion of sparsity. We discuss an intuitive resampling scheme for threshold selection and prove a general cross-validation result that justifies this approach. We also compare thresholding to other covariance estimators in simulations and on an example from climate data. 1. Introduction. Estimation of covariance matrices is important in a number of areas of statistical analysis, including dimension reduction by principal component analysis (PCA), classification by linear or quadratic discriminant analysis (LDA and QDA), establishing independence and conditional independence relations in the context of graphical models, and setting confidence intervals on linear functions of the means of the components. In recent years, many application areas where these tools are used have been dealing with very high-dimensional datasets, and sample sizes can be very small relative to dimension. Examples include genetic data, brain imaging, spectroscopic imaging, climate data and many others. It is well known by now that the empirical covariance matrix for samples of size n from a p-variate Gaussian distribution, Np(μ, � p), is not a good estimator of the population covariance if p is large. Many results in random matrix theory illustrate this, from the classical Mary law [29] to the more recent work of Johnstone and his students on the theory of the largest eigenvalues [12, 23, 30] and associated eigenvectors [24]. However, with the exception of a method for estimating the covariance spectrum [11], these probabilistic results do not offer alternatives to the sample covariance matrix. Alternative estimators for large covariance matrices have therefore attracted a lot of attention recently. Two broad classes of covariance estimators have emerged: those that rely on a natural ordering among variables, and assume that variables

1,052 citations


Book
01 Jan 2008
TL;DR: In this paper, the authors present an approach to multivariate data analysis for paleontological data, which is based on the allometric equation and a set of properties of the data.
Abstract: Preface. Acknowledgments. 1 Introduction. 1.1 The nature of paleontological data. 1.2 Advantages and pitfalls of paleontological data analysis. 1.3 Software. 2 Basic statistical methods. 2.1 Introduction. 2.2 Statistical distributions. 2.3 Shapiro-Wilk test for normal distribution. 2.4 F test for equality of variances. 2.5 Student's t test and Welch test for equality of means. 2.6 Mann-Whitney U test for equality of medians. 2.7 Kolmogorov-Smirnov test for equality of distributions. 2.8 Permutation and resampling. 2.9 One-way ANOVA. 2.10 Kruskal-Wallis test. 2.11 Linear correlation. 2.12 Non-parametric tests for correlation. 2.13 Linear regression. 2.14 Reduced major axis regression. 2.15 Nonlinear curve fitting. 2.16 Chi-square test. 3 Introduction to multivariate data analysis. 3.1 Approaches to multivariate data analysis. 3.2 Multivariate distributions. 3.3 Parametric multivariate tests. 3.4 Non-parametric multivariate tests. 3.5 Hierarchical cluster analysis. 3.5 K-means cluster analysis. 4 Morphometrics. 4.1 Introduction. 4.2 The allometric equation. 4.3 Principal components analysis (PCA). 4.4 Multivariate allometry. 4.5 Discriminant analysis for two groups. 4.6 Canonical variate analysis (CVA). 4.7 MANOVA. 4.8 Fourier shape analysis. 4.9 Elliptic Fourier analysis. 4.10 Eigenshape analysis. 4.11 Landmarks and size measures. 4.12 Procrustean fitting. 4.13 PCA of landmark data. 4.14 Thin-plate spline deformations. 4.15 Principal and partial warps. 4.16 Relative warps. 4.17 Regression of partial warp scores. 4.18 Disparity measures. 4.19 Point distribution statistics. 4.20 Directional statistics. Case study: The ontogeny of a Silurian trilobite. 5 Phylogenetic analysis. 5.1 Introduction. 5.2 Characters. 5.3 Parsimony analysis. 5.4 Character state reconstruction. 5.5 Evaluation of characters and tree topologies. 5.6 Consensus trees. 5.7 Consistency index. 5.8 Retention index. 5.9 Bootstrapping. 5.10 Bremer support. 5.11 Stratigraphical congruency indices. 5.12 Phylogenetic analysis with Maximum Likelihood. Case study: The systematics of heterosporous ferns. 6 Paleobiogeography and paleoecology. 6.1 Introduction. 6.2 Diversity indices. 6.3 Taxonomic distinctness. 6.4 Comparison of diversity indices. 6.5 Abundance models. 6.6 Rarefaction. 6.7 Diversity curves. 6.8 Size-frequency and survivorship curves. 6.9 Association similarity indices for presence/absence data. 6.10 Association similarity indices for abundance data. 6.11 ANOSIM and NPMANOVA. 6.12 Correspondence analysis. 6.13 Principal Coordinates analysis (PCO). 6.14 Non-metric Multidimensional Scaling (NMDS). 6.15 Seriation. Case study: Ashgill brachiopod paleocommunities from East China. 7 Time series analysis. 7.1 Introduction. 7.2 Spectral analysis. 7.3 Autocorrelation. 7.4 Cross-correlation. 7.5 Wavelet analysis. 7.6 Smoothing and filtering. 7.7 Runs test. Case study: Sepkoski's generic diversity curve for the Phanerozoic. 8 Quantitative biostratigraphy. 8.1 Introduction. 8.2 Parametric confidence intervals on stratigraphic ranges. 8.3 Non-parametric confidence intervals on stratigraphic ranges. 8.4 Graphic correlation. 8.5 Constrained optimisation. 8.6 Ranking and scaling. 8.7 Unitary Associations. 8.8 Biostratigraphy by ordination. 8.9 What is the best method for quantitative biostratigraphy?. Appendix A: Plotting techniques. Appendix B: Mathematical concepts and notation. References. Index

867 citations


Journal ArticleDOI
TL;DR: It is shown that even without a fully optimized design, an MPCA-based gait recognition module achieves highly competitive performance and compares favorably to the state-of-the-art gait recognizers.
Abstract: This paper introduces a multilinear principal component analysis (MPCA) framework for tensor object feature extraction. Objects of interest in many computer vision and pattern recognition applications, such as 2D/3D images and video sequences are naturally described as tensors or multilinear arrays. The proposed framework performs feature extraction by determining a multilinear projection that captures most of the original tensorial input variation. The solution is iterative in nature and it proceeds by decomposing the original problem to a series of multiple projection subproblems. As part of this work, methods for subspace dimensionality determination are proposed and analyzed. It is shown that the MPCA framework discussed in this work supplants existing heterogeneous solutions such as the classical principal component analysis (PCA) and its 2D variant (2D PCA). Finally, a tensor object recognition system is proposed with the introduction of a discriminative tensor feature selection mechanism and a novel classification strategy, and applied to the problem of gait recognition. Results presented here indicate MPCA's utility as a feature extraction tool. It is shown that even without a fully optimized design, an MPCA-based gait recognition module achieves highly competitive performance and compares favorably to the state-of-the-art gait recognizers.

856 citations


Journal ArticleDOI
TL;DR: Sparse PCA via regularized SVD provides a uniform treatment of both classical multivariate data and high-dimension-low-sample-size (HDLSS) data, which suggests that sPCA-rSVD provides competitive results.

730 citations


Journal ArticleDOI
Nojun Kwak1
TL;DR: A method of principal component analysis (PCA) based on a new L1-norm optimization technique which is robust to outliers and invariant to rotations and also proven to find a locally maximal solution.
Abstract: A method of principal component analysis (PCA) based on a new L1-norm optimization technique is proposed. Unlike conventional PCA which is based on L2-norm, the proposed method is robust to outliers because it utilizes L1-norm which is less sensitive to outliers. It is invariant to rotations as well. The proposed L1-norm optimization technique is intuitive, simple, and easy to implement. It is also proven to find a locally maximal solution. The proposed method is applied to several datasets and the performances are compared with those of other conventional methods.

715 citations


Journal ArticleDOI
TL;DR: It is found that gradients and waves observed in Cavalli-Sforza et al.'s maps resemble sinusoidal mathematical artifacts that arise generally when PCA is applied to spatial data, implying that the patterns do not necessarily reflect specific migration events.
Abstract: Nearly 30 years ago, Cavalli-Sforza et al. pioneered the use of principal component analysis (PCA) in population genetics and used PCA to produce maps summarizing human genetic variation across continental regions. They interpreted gradient and wave patterns in these maps as signatures of specific migration events. These interpretations have been controversial, but influential, and the use of PCA has become widespread in analysis of population genetics data. However, the behavior of PCA for genetic data showing continuous spatial variation, such as might exist within human continental groups, has been less well characterized. Here, we find that gradients and waves observed in Cavalli-Sforza et al.'s maps resemble sinusoidal mathematical artifacts that arise generally when PCA is applied to spatial data, implying that the patterns do not necessarily reflect specific migration events. Our findings aid interpretation of PCA results and suggest how PCA can help correct for continuous population structure in association studies.

596 citations


Journal ArticleDOI
TL;DR: The evaluation of the pan-sharpened images using global validation indexes reveal that the adaptive PCA approach helps reducing the spectral distortion, and its merger with contourlets provides better fusion results.
Abstract: High correlation among the neighboring pixels both spatially and spectrally in a multispectral image makes it necessary to use an efficient data transformation approach before performing pan-sharpening. Wavelets and principal component analysis (PCA) methods have been a popular choice for spatial and spectral transformations, respectively. Current PCA-based pan-sharpening methods make an assumption that the first principal component (PC) of high variance is an ideal choice for replacing or injecting it with high spatial details from the high-resolution histogram-matched panchromatic (PAN) image. This paper presents a combined adaptive PCA-contourlet approach for pan-sharpening, where the adaptive PCA is used to reduce the spectral distortion and the use of nonsubsampled contourlets for spatial transformation in pan-sharpening is incorporated to overcome the limitation of the wavelets in representing the directional information efficiently and capturing intrinsic geometrical structures of the objects. The efficiency of the presented method is tested by performing pan-sharpening of the high-resolution (IKONOS and QuickBird) and the medium-resolution (Landsat-7 Enhanced Thematic Mapper Plus) datasets. The evaluation of the pan-sharpened images using global validation indexes reveal that the adaptive PCA approach helps reducing the spectral distortion, and its merger with contourlets provides better fusion results.

587 citations


Journal ArticleDOI
30 Apr 2008-Heredity
TL;DR: This paper proposes a new spatially explicit multivariate method, spatial principal component analysis (sPCA), to investigate the spatial pattern of genetic variability using allelic frequency data of individuals or populations, and shows that sPCA performed better than PCA to reveal spatial genetic patterns.
Abstract: Increasing attention is being devoted to taking landscape information into account in genetic studies. Among landscape variables, space is often considered as one of the most important. To reveal spatial patterns, a statistical method should be spatially explicit, that is, it should directly take spatial information into account as a component of the adjusted model or of the optimized criterion. In this paper we propose a new spatially explicit multivariate method, spatial principal component analysis (sPCA), to investigate the spatial pattern of genetic variability using allelic frequency data of individuals or populations. This analysis does not require data to meet Hardy–Weinberg expectations or linkage equilibrium to exist between loci. The sPCA yields scores summarizing both the genetic variability and the spatial structure among individuals (or populations). Global structures (patches, clines and intermediates) are disentangled from local ones (strong genetic differences between neighbors) and from random noise. Two statistical tests are proposed to detect the existence of both types of patterns. As an illustration, the results of principal component analysis (PCA) and sPCA are compared using simulated datasets and real georeferenced microsatellite data of Scandinavian brown bear individuals (Ursus arctos). sPCA performed better than PCA to reveal spatial genetic patterns. The proposed methodology is implemented in the adegenet package of the free software R.

543 citations


01 Jan 2008
TL;DR: This paper presents statistical methods to identify extreme values and data outliers in the ECDF- or CP-plot, a mighty tool in graphical data analysis, and some common mistakes in geochemical mapping.
Abstract: Preface. Acknowledgements. About the Authors. 1. Introduction. 1.1 The Kola Ecogeochemistry Project. 2. Preparing the Data for Use in R and DAS+R. 2.1 Required data format for import into R and DAS+R. 2.2 The detection limit problem. 2.3 Missing Values. 2.4 Some "typical" problems encountered when editing a laboratory data report file to a DAS+R file. 2.5 Appending and linking data files. 2.6 Requirements for a geochemical database. 2.7 Summary. 3. Graphics to Display the Data Distribution. 3.1 The one-dimensional scatterplot. 3.2 The histogram. 3.3 The density trace. 3.4 Plots of the distribution function. 3.5 Boxplots. 3.6 Combination of histogram, density trace, one-dimensional scatterplot, boxplot, and ECDF-plot. 3.7 Combination of histogram, boxplot or box-and-whisker plot, ECDF-plot, and CP-plot. 3.8 Summary. 4. Statistical Distribution Measures. 4.1 Central value. 4.2 Measures of spread. 4.3 Quartiles, quantiles and percentiles. 4.4 Skewness. 4.5 Kurtosis. 4.6 Summary table of statistical distribution measures. 4.7 Summary. 5. Mapping Spatial Data. 5.1 Map coordinate systems (map projection). 5.2 Map scale. 5.3 Choice of the base map for geochemical mapping 5.4 Mapping geochemical data with proportional dots. 5.5 Mapping geochemical data using classes. 5.6 Surface maps constructed with smoothing techniques. 5.7 Surface maps constructed with kriging. 5.8 Colour maps. 5.9 Some common mistakes in geochemical mapping. 5.10 Summary. 6. Further Graphics for Exploratory Data Analysis. 6.1 Scatterplots (xy-plots). 6.2 Linear regression lines. 6.3 Time trends. 6.4 Spatial trends. 6.5 Spatial distance plot. 6.6 Spiderplots (normalized multi-element diagrams). 6.7 Scatterplot matrix. 6.8 Ternary plots. 6.9 Summary. 7. Defining Background and Threshold, Identification of Data Outliers and Element Sources. 7.1 Statistical methods to identify extreme values and data outliers. 7.2 Detecting outliers and extreme values in the ECDF- or CP-plot. 7.3 Including the spatial distribution in the definition of background. 7.4 Methods to distinguish geogenic from anthropogenic element sources. 7.5 Summary. 8. Comparing Data in Tables and Graphics. 8.1 Comparing data in tables. 8.2 Graphical comparison of the data distributions of several data sets. 8.3 Comparing the spatial data structure. 8.4 Subset creation - a mighty tool in graphical data analysis. 8.5 Data subsets in scatterplots. 8.6 Data subsets in time and spatial trend diagrams. 8.7 Data subsets in ternary plots. 8.8 Data subsets in the scatterplot matrix. 8.9 Data subsets in maps. 8.10 Summary. 9. Comparing Data Using Statistical Tests. 9.1 Tests for distribution (Kolmogorov-Smirnov and Shapiro-Wilk tests). 9.2 The one-sample t-test (test for the central value). 9.3 Wilcoxon signed-rank test. 9.4 Comparing two central values of the distributions of independent data groups. 9.5 Comparing two central values of matched pairs of data. 9.6 Comparing the variance of two test. 9.7 Comparing several central values. 9.8 Comparing the variance of several data groups. 9.9 Comparing several central values of dependent groups. 9.10 Summary. 10. Improving Data Behaviour for Statistical Analysis: Ranking and Transformations. 10.1 Ranking/sorting. 10.2 Non-linear transformations. 10.3 Linear transformations. 10.4 Preparing a data set for multivariate data analysis. 10.5 Transformations for closed number systems. 10.6 Summary. 11. Correlation. 11.1 Pearson correlation. 11.2 Spearman rank correlation. 11.3 Kendall-tau correlation. 11.4 Robust correlation coefficients. 11.5 When is a correlation coefficient significant? 11.6 Working with many variables. 11.7 Correlation analysis and inhomogeneous data. 11.8 Correlation results following addictive logratio or central logratio transformations. 11.9 Summary. 12. Multivariate Graphics . 12.1 Profiles. 12.2 Stars. 12.3 Segments. 12.4 Boxes. 12.5 Castles and trees. 12.6 Parallel coordinates plot. 12.7 Summary. 13. Multivariate Outlier Detection. 13.1 Univariate versus multivariate outlier detection. 13.2 Robust versus non-robust outlier detection. 13.3 The chi-square plot. 13.4 Automated multivariate outlier detection and visualization. 13.5 Other graphical approaches for identifying outliers and groups. 13.6 Summary. 14. Principal Component Analysis (PCA) and Factor Analysis (FA). 14.1 Conditioning the data for PCA and FA. 14.2 Principal component analysis (PCA). 14.3 Factor Analysis. 14.4 Summary. 15. Cluster Analysis. 15.1 Possible data problems in the context of cluster analysis. 15.2 Distance measures. 15.3 Clustering samples. 15.4 Clustering variables. 15.5 Evaluation of cluster validity. 15.6 Selection of variables for cluster analysis. 15.7 Summary. 16. Regression Analysis (RA). 16.1 Data requirements for regression analysis. 16.2 Multiple regression. 16.3 Classical least squares (LS) regression. 16.4 Robust regression. 16.5 Model selection in regression analysis. 16.6 Other regression methods. 16.7 Summary. 17. Discriminant Analysis (DA) and Other Knowledge-Based Classification Methods. 17.1 Methods for discriminant analysis. 17.2 Data requirements for discriminant analysis. 17.3 Visualisation of the discriminant function. 17.4 Prediction with discriminant analysis. 17.5 Exploring for similar data structures. 17.6 Other knowledge-based classification methods/ 17.7 Summary. 18. Quality Control (QC). 18.1 Randomised samples. 18.2 Trueness. 18.3 Accuracy. 18.4 Precision. 18.5 Analysis of variance (ANOVA) 18.6 Using Maps to assess data quality. 18.7 Variables analysed by two different analytical techniques. 18.8 Working with censored data - a practical example. 18.9 Summary. 19. Introduction to R and Structure of the DAS+R Graphical User Interface. 19.1 R. 19.2 R-scripts. 19.3 A brief overview of relevant R commands. 19.4 DAS+R. 19.5 Summary. References. Index.

Posted Content
TL;DR: A new approach to sparse principal component analysis (sparse PCA) aimed at extracting a single sparse dominant principal component of a data matrix, or more components at once, respectively is developed.
Abstract: In this paper we develop a new approach to sparse principal component analysis (sparse PCA). We propose two single-unit and two block optimization formulations of the sparse PCA problem, aimed at extracting a single sparse dominant principal component of a data matrix, or more components at once, respectively. While the initial formulations involve nonconvex functions, and are therefore computationally intractable, we rewrite them into the form of an optimization program involving maximization of a convex function on a compact set. The dimension of the search space is decreased enormously if the data matrix has many more columns (variables) than rows. We then propose and analyze a simple gradient method suited for the task. It appears that our algorithm has best convergence properties in the case when either the objective function or the feasible set are strongly convex, which is the case with our single-unit formulations and can be enforced in the block case. Finally, we demonstrate numerically on a set of random and gene expression test problems that our approach outperforms existing algorithms both in quality of the obtained solution and in computational speed.

Journal ArticleDOI
TL;DR: The two-stage classifier is integrated with the mixed-band wavelet-chaos methodology, developed earlier by the authors, for accurate and robust classification of electroencephalogram (EEGs) into healthy, ictal, and interictal EEGs.
Abstract: A novel principal component analysis (PCA)-enhanced cosine radial basis function neural network classifier is presented. The two-stage classifier is integrated with the mixed-band wavelet-chaos methodology, developed earlier by the authors, for accurate and robust classification of electroencephalogram (EEGs) into healthy, ictal, and interictal EEGs. A nine-parameter mixed-band feature space discovered in previous research for effective EEG representation is used as input to the two-stage classifier. In the first stage, PCA is employed for feature enhancement. The rearrangement of the input space along the principal components of the data improves the classification accuracy of the cosine radial basis function neural network (RBFNN) employed in the second stage significantly. The classification accuracy and robustness of the classifier are validated by extensive parametric and sensitivity analysis. The new wavelet-chaos-neural network methodology yields high EEG classification accuracy (96.6%) and is quite robust to changes in training data with a low standard deviation of 1.4%. For epilepsy diagnosis, when only normal and interictal EEGs are considered, the classification accuracy of the proposed model is 99.3%. This statistic is especially remarkable because even the most highly trained neurologists do not appear to be able to detect interictal EEGs more than 80% of the times.

Journal ArticleDOI
TL;DR: In this paper, a new semidefinite relaxation is proposed to solve the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination.
Abstract: Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefficients, with total complexity O(n3), where n is the number of variables. We then use the same relaxation to derive sufficient conditions for global optimality of a solution, which can be tested in O(n3), per pattern. We discuss applications in subset selection and sparse recovery and show on artificial examples and biological data that our algorithm does provide globally optimal solutions in many cases.

Posted Content
TL;DR: This work describes an efficient algorithm for the low-rank approximation of matrices that produces accuracy that is very close to the best possible accuracy, for matrices of arbitrary sizes.
Abstract: Principal component analysis (PCA) requires the computation of a low-rank approximation to a matrix containing the data being analyzed. In many applications of PCA, the best possible accuracy of any rank-deficient approximation is at most a few digits (measured in the spectral norm, relative to the spectral norm of the matrix being approximated). In such circumstances, efficient algorithms have not come with guarantees of good accuracy, unless one or both dimensions of the matrix being approximated are small. We describe an efficient algorithm for the low-rank approximation of matrices that produces accuracy very close to the best possible, for matrices of arbitrary sizes. We illustrate our theoretical results via several numerical examples.

Journal ArticleDOI
TL;DR: In this paper, the most commonly used generic PCA cross-validation schemes are reviewed and how well they work in various scenarios are assessed.
Abstract: In regression, cross-validation is an effective and popular approach that is used to decide, for example, the number of underlying features, and to estimate the average prediction error. The basic principle of cross-validation is to leave out part of the data, build a model, and then predict the left-out samples. While such an approach can also be envisioned for component models such as principal component analysis (PCA), most current implementations do not comply with the essential requirement that the predictions should be independent of the entity being predicted. Further, these methods have not been properly reviewed in the literature. In this paper, we review the most commonly used generic PCA cross-validation schemes and assess how well they work in various scenarios.

Journal ArticleDOI
TL;DR: In this article, a matrix perturbation approach was used to study the nonasymptotic relation between the eigenvalues and eigenvectors of PCA computed on a finite sample of size n and those of the limiting population PCA as n → oo.
Abstract: Principal component analysis (PCA) is a standard tool for dimensional reduction of a set of n observations (samples), each with p variables. In this paper, using a matrix perturbation approach, we study the nonasymptotic relation between the eigenvalues and eigenvectors of PCA computed on a finite sample of size n, and those of the limiting population PCA as n → oo. As in machine learning, we present a finite sample theorem which holds with high probability for the closeness between the leading eigenvalue and eigenvector of sample PCA and population PCA under a spiked covariance model. In addition, we also consider the relation between finite sample PCA and the asymptotic results in the joint limit p, n → ∞, with p/n = c. We present a matrix perturbation view of the "phase transition phenomenon," and a simple linear-algebra based derivation of the eigenvalue and eigenvector overlap in this asymptotic limit. Moreover, our analysis also applies for finite p, n where we show that although there is no sharp phase transition as in the infinite case, either as a function of noise level or as a function of sample size n, the eigenvector of sample PCA may exhibit a sharp "loss of tracking," suddenly losing its relation to the (true) eigenvector of the population PCA matrix. This occurs due to a crossover between the eigenvalue due to the signal and the largest eigenvalue due to noise, whose eigenvector points in a random direction.

Journal ArticleDOI
TL;DR: The theoretical analysis of the effects of PCA on the discrimination power of the projected subspace is presented from a general pattern classification perspective for two possible scenarios: when PCA is used as a simple dimensionality reduction tool and when it is used to recondition an ill-posed LDA formulation.
Abstract: Dimensionality reduction is a necessity in most hyperspectral imaging applications. Tradeoffs exist between unsupervised statistical methods, which are typically based on principal components analysis (PCA), and supervised ones, which are often based on Fisher's linear discriminant analysis (LDA), and proponents for each approach exist in the remote sensing community. Recently, a combined approach known as subspace LDA has been proposed, where PCA is employed to recondition ill-posed LDA formulations. The key idea behind this approach is to use a PCA transformation as a preprocessor to discard the null space of rank-deficient scatter matrices, so that LDA can be applied on this reconditioned space. Thus, in theory, the subspace LDA technique benefits from the advantages of both methods. In this letter, we present a theoretical analysis of the effects (often ill effects) of PCA on the discrimination power of the projected subspace. The theoretical analysis is presented from a general pattern classification perspective for two possible scenarios: (1) when PCA is used as a simple dimensionality reduction tool and (2) when it is used to recondition an ill-posed LDA formulation. We also provide experimental evidence of the ineffectiveness of both scenarios for hyperspectral target recognition applications.

Journal ArticleDOI
TL;DR: Evidence is found that the observed geographic gradients, traditionally thought to represent major historical migrations, may in fact have other interpretations.
Abstract: Principal component analysis (PCA) has been a useful tool for analysis of genetic data, particularly in studies of human migration. A new study finds evidence that the observed geographic gradients, traditionally thought to represent major historical migrations, may in fact have other interpretations.

Book
15 Sep 2008
TL;DR: In this article, principal components analysis, exploratory factor analysis, principal component scaling and correlation analysis, and cluster analysis of multivariate data are presented. But the analysis of repeated measures data is not considered.
Abstract: Multivariate Data and Multivariate Analysis.- Looking at Multivariate Data.- Principal Components Analysis.- Exploratory Factor Analysis.- Multidimensional Scaling and Correspondence Analysis.- Cluster Analysis.- Grouped Multivariate Data: Multivariate Analysis of Variance and Discriminant Function Analysis.- Multiple Regression and Canonical Correlation.- Analysis of Repeated Measures Data.

01 Jan 2008
TL;DR: The results show that the classification accuracy based on PCA is highly sensitive to the type of data and that the variance captured the principal components is not necessarily a vital indicator for the classification performance.
Abstract: Dimensionality reduction and feature subset selection are two techniques for reducing the attribute space of a feature set, which is an important component of both supervised and unsupervised classification or regression problems. While in feature subset selection a subset of the original attributes is extracted, dimensionality reduction in general produces linear combinations of the original attribute set. In this paper we investigate the relationship between several attribute space reduction techniques and the resulting classification accuracy for two very different application areas. On the one hand, we consider e-mail filtering, where the feature space contains various properties of e-mail messages, and on the other hand, we consider drug discovery problems, where quantitative representations of molecular structures are encoded in terms of information-preserving descriptor values. Subsets of the original attributes constructed by filter and wrapper techniques as well as subsets of linear combinations of the original attributes constructed by three different variants of the principle component analysis (PCA) are compared in terms of the classification performance achieved with various machine learning algorithms as well as in terms of runtime performance. We successively reduce the size of the attribute sets and investigate the changes in the classification results. Moreover, we explore the relationship between the variance captured in the linear combinations within PCA and the resulting classification accuracy. The results show that the classification accuracy based on PCA is highly sensitive to the type of data and that the variance captured the principal components is not necessarily a vital indicator for the classification performance.

Journal ArticleDOI
TL;DR: This paper analyzes a simple and computationally inexpensive diagonal cut-off method, and establishes a threshold of the order thetasdiag = n/[k2 log(p-k)] separating success from failure, and proves that a more complex semidefinite programming (SDP) relaxation due to dpsilaAspremont et al., succeeds once the sample size is of theorder thetassdp.
Abstract: Principal component analysis (PCA) is a classical method for dimensionality reduction based on extracting the dominant eigenvectors of the sample covariance matrix. However, PCA is well known to behave poorly in the ``large $p$, small $n$'' setting, in which the problem dimension $p$ is comparable to or larger than the sample size $n$. This paper studies PCA in this high-dimensional regime, but under the additional assumption that the maximal eigenvector is sparse, say, with at most $k$ nonzero components. We consider a spiked covariance model in which a base matrix is perturbed by adding a $k$-sparse maximal eigenvector, and we analyze two computationally tractable methods for recovering the support set of this maximal eigenvector, as follows: (a) a simple diagonal thresholding method, which transitions from success to failure as a function of the rescaled sample size $\theta_{\mathrm{dia}}(n,p,k)=n/[k^2\log(p-k)]$; and (b) a more sophisticated semidefinite programming (SDP) relaxation, which succeeds once the rescaled sample size $\theta_{\mathrm{sdp}}(n,p,k)=n/[k\log(p-k)]$ is larger than a critical threshold. In addition, we prove that no method, including the best method which has exponential-time complexity, can succeed in recovering the support if the order parameter $\theta_{\mathrm{sdp}}(n,p,k)$ is below a threshold. Our results thus highlight an interesting trade-off between computational and statistical efficiency in high-dimensional inference.

Journal ArticleDOI
TL;DR: A new method to automatically determine the number of components from a limited number of (possibly) high dimensional noisy samples, based on the eigenvalues of the sample covariance matrix, which compares favorably with other common algorithms.

Journal ArticleDOI
TL;DR: Kernel PCA is applied to address the limitations associated with the standard K–L expansion and is capable of generating differentiable parameterizations that reproduce the essential features of complex geological structures represented by multipoint geostatistics.
Abstract: This paper describes a novel approach for creating an efficient, general, and differentiable parameterization of large-scale non-Gaussian, non-stationary random fields (represented by multipoint geostatistics) that is capable of reproducing complex geological structures such as channels. Such parameterizations are appropriate for use with gradient-based algorithms applied to, for example, history-matching or uncertainty propagation. It is known that the standard Karhunen–Loeve (K–L) expansion, also called linear principal component analysis or PCA, can be used as a differentiable parameterization of input random fields defining the geological model. The standard K–L model is, however, limited in two respects. It requires an eigen-decomposition of the covariance matrix of the random field, which is prohibitively expensive for large models. In addition, it preserves only the two-point statistics of a random field, which is insufficient for reproducing complex structures. In this work, kernel PCA is applied to address the limitations associated with the standard K–L expansion. Although widely used in machine learning applications, it does not appear to have found any application for geological model parameterization. With kernel PCA, an eigen-decomposition of a small matrix called the kernel matrix is performed instead of the full covariance matrix. The method is much more efficient than the standard K–L procedure. Through use of higher order polynomial kernels, which implicitly define a high-dimensionality feature space, kernel PCA further enables the preservation of high-order statistics of the random field, instead of just two-point statistics as in the K–L method. The kernel PCA eigen-decomposition proceeds using a set of realizations created by geostatistical simulation (honoring two-point or multipoint statistics) rather than the analytical covariance function. We demonstrate that kernel PCA is capable of generating differentiable parameterizations that reproduce the essential features of complex geological structures represented by multipoint geostatistics. The kernel PCA representation is then applied to history match a water flooding problem. This example demonstrates that kernel PCA can be used with gradient-based history matching to provide models that match production history while maintaining multipoint geostatistics consistent with the underlying training image.

15 Sep 2008
TL;DR: In this article, the authors investigate the relationship between several attribute space reduction techniques and the resulting classification accuracy for two very different application areas, e.g., e-mail filtering and drug discovery.
Abstract: Dimensionality reduction and feature subset selection are two techniques for reducing the attribute space of a feature set, which is an important component of both supervised and unsupervised classification or regression problems. While in feature subset selection a subset of the original attributes is extracted, dimensionality reduction in general produces linear combinations of the original attribute set. In this paper we investigate the relationship between several attribute space reduction techniques and the resulting classification accuracy for two very different application areas. On the one hand, we consider e-mail filtering, where the feature space contains various properties of e-mail messages, and on the other hand, we consider drug discovery problems, where quantitative representations of molecular structures are encoded in terms of information-preserving descriptor values. Subsets of the original attributes constructed by filter and wrapper techniques as well as subsets of linear combinations of the original attributes constructed by three different variants of the principle component analysis (PCA) are compared in terms of the classification performance achieved with various machine learning algorithms as well as in terms of runtime performance. We successively reduce the size of the attribute sets and investigate the changes in the classification results. Moreover, we explore the relationship between the variance captured in the linear combinations within PCA and the resulting classification accuracy. The results show that the classification accuracy based on PCA is highly sensitive to the type of data and that the variance captured the principal components is not necessarily a vital indicator for the classification performance.

Posted Content
Michel Tenenhaus1
01 Jan 2008
TL;DR: In this article, the authors explore the use of ULS-SEM (Structural-Equation-Modeling), PLS (Partial Least Squares), GSCA (Generalized Structured Component Analysis), path analysis on block principal components and path analysis in block scales on customer satisfaction data.
Abstract: In this research, the authors explore the use of ULS-SEM (Structural-Equation-Modelling), PLS (Partial Least Squares), GSCA (Generalized Structured Component Analysis), path analysis on block principal components and path analysis on block scales on customer satisfaction data.

Journal ArticleDOI
TL;DR: This work proposes a novel semisupervised method for dimensionality reduction called Maximum Margin Projection (MMP), which aims at maximizing the margin between positive and negative examples at each local neighborhood.
Abstract: One of the fundamental problems in Content-Based Image Retrieval (CBIR) has been the gap between low-level visual features and high-level semantic concepts. To narrow down this gap, relevance feedback is introduced into image retrieval. With the user-provided information, a classifier can be learned to distinguish between positive and negative examples. However, in real-world applications, the number of user feedbacks is usually too small compared to the dimensionality of the image space. In order to cope with the high dimensionality, we propose a novel semisupervised method for dimensionality reduction called Maximum Margin Projection (MMP). MMP aims at maximizing the margin between positive and negative examples at each local neighborhood. Different from traditional dimensionality reduction algorithms such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), which effectively see only the global euclidean structure, MMP is designed for discovering the local manifold structure. Therefore, MMP is likely to be more suitable for image retrieval, where nearest neighbor search is usually involved. After projecting the images into a lower dimensional subspace, the relevant images get closer to the query image; thus, the retrieval performance can be enhanced. The experimental results on Corel image database demonstrate the effectiveness of our proposed algorithm.

Journal ArticleDOI
Michel Tenenhaus1
TL;DR: In this paper, the authors explore the use of ULS-SEM, PLS, GSCA, path analysis on block principal components and path analysis in block scales on customer satisfaction data.
Abstract: Two complementary schools have come to the fore in the field of Structural Equation Modelling (SEM): covariance-based SEM and component-based SEM. The first approach has been developed around Karl Joreskog and the second one around Herman Wold under the name "PLS" (Partial Least Squares). Hwang and Takane have proposed a new component-based SEM method named Generalized Structured Component Analysis. Covariance-based SEM is usually used with an objective of model validation and needs a large sample. Component-based SEM is mainly used for score computation and can be carried out on very small samples. In this research, we will explore the use of ULS-SEM, PLS, GSCA, path analysis on block principal components and path analysis on block scales on customer satisfaction data. Our conclusion is that score computation and bootstrap validation are very insensitive to the choice of the method when the blocks are homogenous.

Book
05 Jun 2008
TL;DR: This paper discusses collection, preparation, testing, and checking the data for consumer research, as well as further methods in multi-dimensional analysis.
Abstract: PART ONE: COLLECTING, PREPARING AND CHECKING THE DATA Measurement, Errors and Data for Consumer Research Secondary Consumer Data Primary Data Collection Data Preparation and Descriptive Statistics PART TWO: SAMPLING, PROBABILITY AND INFERENCE Sampling Hypothesis Testing Analysis of Variance PART THREE: RELATIONSHIPS AMONG VARIABLES Correlation and Regression Association, Log-linear Analysis and Canonical Correlation Analysis Factor Analysis and Principal Component Analysis PART FOUR: CLASSIFICATION AND SEGMENTATION TECHNIQUES Discriminant Analysis Cluster Analysis Multidimensional Scaling Correspondence Analysis PART FIVE: FURTHER METHODS IN MULTIVARIATE ANALYSIS Structural Equation Models Discrete Choice Models The End (and Beyond)

Journal ArticleDOI
TL;DR: AMMI, T-RF-centered PCA, and DCA were the most robust methods in terms of producing ordinations that consistently reached a consensus with other methods, and in datasets with high sample heterogeneity, NMS analyses with Sørensen and Jaccard distance were themost sensitive for recovery of complex gradients.