scispace - formally typeset
Search or ask a question

Showing papers on "Principal component analysis published in 2006"


Journal ArticleDOI
TL;DR: This work introduces a new method called sparse principal component analysis (SPCA) using the lasso (elastic net) to produce modified principal components with sparse loadings and shows that PCA can be formulated as a regression-type optimization problem.
Abstract: Principal component analysis (PCA) is widely used in data processing and dimensionality reduction. However, PCA suffers from the fact that each principal component is a linear combination of all the original variables, thus it is often difficult to interpret the results. We introduce a new method called sparse principal component analysis (SPCA) using the lasso (elastic net) to produce modified principal components with sparse loadings. We first show that PCA can be formulated as a regression-type optimization problem; sparse loadings are then obtained by imposing the lasso (elastic net) constraint on the regression coefficients. Efficient algorithms are proposed to fit our SPCA models for both regular multivariate data and gene expression arrays. We also give a new formula to compute the total variance of modified principal components. As illustrations, SPCA is applied to real and simulated data with encouraging results.

3,102 citations


Journal ArticleDOI
TL;DR: This work examined the rotation forest ensemble on a random selection of 33 benchmark data sets from the UCI repository and compared it with bagging, AdaBoost, and random forest and prompted an investigation into diversity-accuracy landscape of the ensemble models.
Abstract: We propose a method for generating classifier ensembles based on feature extraction. To create the training data for a base classifier, the feature set is randomly split into K subsets (K is a parameter of the algorithm) and principal component analysis (PCA) is applied to each subset. All principal components are retained in order to preserve the variability information in the data. Thus, K axis rotations take place to form the new features for a base classifier. The idea of the rotation approach is to encourage simultaneously individual accuracy and diversity within the ensemble. Diversity is promoted through the feature extraction for each base classifier. Decision trees were chosen here because they are sensitive to rotation of the feature axes, hence the name "forest". Accuracy is sought by keeping all principal components and also using the whole data set to train each base classifier. Using WEKA, we examined the rotation forest ensemble on a random selection of 33 benchmark data sets from the UCI repository and compared it with bagging, AdaBoost, and random forest. The results were favorable to rotation forest and prompted an investigation into diversity-accuracy landscape of the ensemble models. Diversity-error diagrams revealed that rotation forest ensembles construct individual classifiers which are more accurate than these in AdaBoost and random forest, and more diverse than these in bagging, sometimes more accurate as well

1,708 citations


Journal ArticleDOI
TL;DR: This paper proposes some new feature extractors based on maximum margin criterion (MMC) and establishes a new linear feature extractor that does not suffer from the small sample size problem, which is known to cause serious stability problems for LDA.
Abstract: In pattern recognition, feature extraction techniques are widely employed to reduce the dimensionality of data and to enhance the discriminatory information. Principal component analysis (PCA) and linear discriminant analysis (LDA) are the two most popular linear dimensionality reduction methods. However, PCA is not very effective for the extraction of the most discriminant features, and LDA is not stable due to the small sample size problem . In this paper, we propose some new (linear and nonlinear) feature extractors based on maximum margin criterion (MMC). Geometrically, feature extractors based on MMC maximize the (average) margin between classes after dimensionality reduction. It is shown that MMC can represent class separability better than PCA. As a connection to LDA, we may also derive LDA from MMC by incorporating some constraints. By using some other constraints, we establish a new linear feature extractor that does not suffer from the small sample size problem, which is known to cause serious stability problems for LDA. The kernelized (nonlinear) counterpart of this linear feature extractor is also established in the paper. Our extensive experiments demonstrate that the new feature extractors are effective, stable, and efficient.

838 citations


Reference BookDOI
23 Jun 2006
TL;DR: In this paper, a simple to multiple-to-multipleure correlation analysis approach is proposed to measure the degree of participation of women in the labor force by using a subset of an indicator matrix.
Abstract: CORRESPONDENCE ANALYSIS AND RELATED METHODS IN PRACTICE, Jorg Blasius and Michael Greenacre A simple example Basic method Concepts of correspondence analysis Stacked tables Multiple correspondence analysis Categorical principal components analysis Active and supplementary variables Multiway data Content of the book FROM SIMPLE TO MULTIPLE CORRESPONDENCE ANALYSIS, Michael Greenacre Canonical correlation analysis Geometric approach Supplementary points Discussion and conclusions DIVIDED BY A COMMON LANGUAGE: ANALYZING AND VISUALIZING TWO-WAY ARRAYS, John C. Gower Introduction: two-way tables and data matrices Quantitative variables Categorical variables Fit and scaling Discussion and conclusion NONLINEAR PRINCIPAL COMPONENTS ANALYSIS AND RELATED TECHNIQUES, Jan de Leeuw Linear PCA Least-squares nonlinear PCA Logistic NLPCA Discussion and conclusions Software Notes THE GEOMETRIC ANALYSIS OF STRUCTURED INDIVIDUALS o VARIABLES TABLES, Henry Rouanet PCA and MCA as geometric methods Structured data analysis The basketball study The EPGY study Concluding comments CORRELATIONAL STRUCTURE OF MULTIPLE-CHOICE DATA AS VIEWED FROM DUAL SCALING, Shizuhiko Nishisato Permutations of categories and scaling Principal components analysis and dual scaling Statistics for correlational structure of data Forced classification Correlation between categorical variables Properties of squared item-total correlation Structure of nonlinear correlation Concluding remarks VALIDATION TECHNIQUES IN MULTIPLE CORRESPONDENCE ANALYSIS, Ludovic Lebart External validation Internal validation (resampling techniques) Example of MCA validation Conclusion MULTIPLE CORRESPONDENCE ANALYSIS OF SUBSETS OF RESPONSE CATEGORIES, Michael Greenacre and Rafael Pardo Correspondence analysis of a subset of an indicator matrix Application to women's participation in labor force Subset MCA applied to the Burt matrix Discussion and conclusions SCALING UNIDIMENSIONAL MODELS WITH MULTIPLE CORRESPONDENCE ANALYSIS, Matthijs J. Warrens and Willem J. Heiser The dichotomous Guttman scale The Rasch model The polytomous Guttman scale The graded response model Unimodal models Conclusion THE UNFOLDING FALLACY UNVEILED: VISUALIZING STRUCTURES OF DICHOTOMOUS UNIDIMENSIONAL ITEM-RESPONSE-THEORY DATA BY MULTIPLE CORRESPONDENCE ANALYSIS, Wijbrandt van Schuur and Jorg Blasius Item response models for dominance data Visualizing dominance data Item response models for proximity data Visualizing unfolding data Every two cumulative scales can be represented as a single unfolding scale Consequences for unfolding analysis Discussion REGULARIZED MULTIPLE CORRESPONDENCE ANALYSIS, Yoshio Takane and Heungsun Hwang The method Examples Concluding remarks THE EVALUATION OF "DON'T KNOW" RESPONSES BY GENERALIZED CANONICAL ANALYSIS, Herbert Matschinger and Matthias C. Angermeyer Method Results Discussion MULTIPLE FACTOR ANALYSIS FOR CONTINGENCY TABLES, Jerome Pages and Monica Becue-Bertaut Tabular conventions Internal correspondence analysis Balancing the influence of the different tables Multiple factor analysis for contingency tables (MFACT) MFACT properties Rules for studying the suitability of MFACT for a data set Conclusion SIMULTANEOUS ANALYSIS: A JOINT STUDY OF SEVERAL CONTINGENCY TABLES WITH DIFFERENT MARGINS, Amaya Zarraga and Beatriz Goitisolo Simultaneous analysis Interpretation rules for simultaneous analysis Comments on the appropriateness of the method Application: study of levels of employment and unemployment according to autonomous community, gender, and training level Conclusions MULTIPLE FACTOR ANALYSIS OF MIXED TABLES OF METRIC AND CATEGORICAL DATA, Elena Abascal, Ignacio Garcia Lautre, and M. Isabel Landaluce Multiple factor analysis MFA of a mixed table: an alternative to PCA and MCA Analysis of voting patterns across provinces in Spain's 2004 general election Conclusions CORRESPONDENCE ANALYSIS AND CLASSIFICATION, Gilbert Saporta and Ndeye Niang Linear methods for classification The "Disqual" methodology Alternative methods A case study Conclusion MULTIBLOCK CANONICAL CORRELATION ANALYSIS FOR CATEGORICAL VARIABLES: APPLICATION TO EPIDEMIOLOGICAL DATA, Stephanie Bougeard, Mohamed Hanafi, Hicham Nocairi, and El-Mostafa Qannari Multiblock canonical correlation analysis Application Discussion and perspectives PROJECTION-PURSUIT APPROACH FOR CATEGORICAL DATA, Henri Caussinus and Anne Ruiz-Gazen Continuous variables Categorical variables Conclusion CORRESPONDENCE ANALYSIS AND CATEGORICAL CONJOINT MEASUREMENT, Anna Torres-Lacomba Categorical conjoint measurement Correspondence analysis and canonical correlation analysis Correspondence analysis and categorical conjoint analysis Incorporating interactions Discussion and conclusions A THREE-STEP APPROACH TO ASSESSING THE BEHAVIOR OF SURVEY ITEMS IN CROSS-NATIONAL RESEARCH, Jorg Blasius and Victor Thiessen Data Method Solutions Discussion ADDITIVE AND MULTIPLICATIVE MODELS FOR THREE-WAY CONTINGENCY TABLES: DARROCH (1974) REVISITED, Pieter M. Kroonenberg and Carolyn J. Anderson Data and design issues Multiplicative and additive modeling Multiplicative models Additive models: three-way correspondence analysis Categorical principal components analysis Discussion and conclusions A NEW MODEL FOR VISUALIZING INTERACTIONS IN ANALYSIS OF VARIANCE, Patrick J.F. Groenen and Alex J. Koning Holiday-spending data Decomposing interactions Interaction decomposition of holiday spending Conclusions LOGISTIC BIPLOTS. Jose L. Vicente-Villardon, M. Purificacion Galindo-Villardon, and Antonio Blazquez-Zaballos Classical biplots Logistic biplot Application: microarray gene expression data Final remarks References Appendix Index

787 citations


Journal ArticleDOI
TL;DR: Supervised Principal Component Analysis (SPCA) as mentioned in this paper is similar to conventional principal components analysis except that it uses a subset of the predictors selected based on their association with the outcome, and can be applied to regression and generalized regression problems, such as survival analysis.
Abstract: In regression problems where the number of predictors greatly exceeds the number of observations, conventional regression techniques may produce unsatisfactory results. We describe a technique called supervised principal components that can be applied to this type of problem. Supervised principal components is similar to conventional principal components analysis except that it uses a subset of the predictors selected based on their association with the outcome. Supervised principal components can be applied to regression and generalized regression problems, such as survival analysis. It compares favorably to other techniques for this type of problem, and can also account for the effects of other covariates and help identify which predictor variables are most important. We also provide asymptotic consistency results to help support our empirical findings. These methods could become important tools for DNA microarray data, where they may be used to more accurately diagnose and treat cancer.

773 citations


Proceedings ArticleDOI
25 Jun 2006
TL;DR: Experiments on several real-life datasets show R1-PCA can effectively handle outliers and it is shown that L1-norm K-means leads to poor results while R2-K-MEans outperforms standard K-Means.
Abstract: Principal component analysis (PCA) minimizes the sum of squared errors (L2-norm) and is sensitive to the presence of outliers. We propose a rotational invariant L1-norm PCA (R1-PCA). R1-PCA is similar to PCA in that (1) it has a unique global solution, (2) the solution are principal eigenvectors of a robust covariance matrix (re-weighted to soften the effects of outliers), (3) the solution is rotational invariant. These properties are not shared by the L1-norm PCA. A new subspace iteration algorithm is given to compute R1-PCA efficiently. Experiments on several real-life datasets show R1-PCA can effectively handle outliers. We extend R1-norm to K-means clustering and show that L1-norm K-means leads to poor results while R1-K-means outperforms standard K-means.

676 citations


Journal ArticleDOI
TL;DR: The ability of several algorithms to identify the correct muscle synergies and activation coefficients in simulated data, combined with their consistency when applied to physiological data sets, suggests that the Muscle synergies found by a particular algorithm are not an artifact of that algorithm, but reflect basic aspects of the organization of muscle activation patterns underlying behaviors.
Abstract: Several recent studies have used matrix factorization algorithms to assess the hypothesis that behaviors might be produced through the combination of a small number of muscle synergies. Although generally agreeing in their basic conclusions, these studies have used a range of different algorithms, making their interpretation and integration difficult. We therefore compared the performance of these different algorithms on both simulated and experimental data sets. We focused on the ability of these algorithms to identify the set of synergies underlying a data set. All data sets consisted of nonnegative values, reflecting the nonnegative data of muscle activation patterns. We found that the performance of principal component analysis (PCA) was generally lower than that of the other algorithms in identifying muscle synergies. Factor analysis (FA) with varimax rotation was better than PCA, and was generally at the same levels as independent component analysis (ICA) and nonnegative matrix factorization (NMF). ICA performed very well on data sets corrupted by constant variance Gaussian noise, but was impaired on data sets with signal-dependent noise and when synergy activation coefficients were correlated. Nonnegative matrix factorization (NMF) performed similarly to ICA and FA on data sets with signal-dependent noise and was generally robust across data sets. The best algorithms were ICA applied to the subspace defined by PCA (ICAPCA) and a version of probabilistic ICA with nonnegativity constraints (pICA). We also evaluated some commonly used criteria to identify the number of synergies underlying a data set, finding that only likelihood ratios based on factor analysis identified the correct number of synergies for data sets with signal-dependent noise in some cases. We then proposed an ad hoc procedure, finding that it was able to identify the correct number in a larger number of cases. Finally, we applied these methods to an experimentally obtained data set. The best performing algorithms (FA, ICA, NMF, ICAPCA, pICA) identified synergies very similar to one another. Based on these results, we discuss guidelines for using factorization algorithms to analyze muscle activation patterns. More generally, the ability of several algorithms to identify the correct muscle synergies and activation coefficients in simulated data, combined with their consistency when applied to physiological data sets, suggests that the muscle synergies found by a particular algorithm are not an artifact of that algorithm, but reflect basic aspects of the organization of muscle activation patterns underlying behaviors.

672 citations


Journal ArticleDOI
TL;DR: This paper presents an independent component analysis (ICA) approach to DR, to be called ICA-DR which uses mutual information as a criterion to measure data statistical independency that exceeds second-order statistics.
Abstract: In hyperspectral image analysis, the principal components analysis (PCA) and the maximum noise fraction (MNF) are most commonly used techniques for dimensionality reduction (DR), referred to as PCA-DR and MNF-DR, respectively. The criteria used by the PCA-DR and the MNF-DR are data variance and signal-to-noise ratio (SNR) which are designed to measure data second-order statistics. This paper presents an independent component analysis (ICA) approach to DR, to be called ICA-DR which uses mutual information as a criterion to measure data statistical independency that exceeds second-order statistics. As a result, the ICA-DR can capture information that cannot be retained or preserved by second-order statistics-based DR techniques. In order for the ICA-DR to perform effectively, the virtual dimensionality (VD) is introduced to estimate number of dimensions needed to be retained as opposed to the energy percentage that has been used by the PCA-DR and MNF-DR to determine energies contributed by signal sources and noise. Since there is no prioritization among components generated by the ICA-DR due to the use of random initial projection vectors, we further develop criteria and algorithms to measure the significance of information contained in each of ICA-generated components for component prioritization. Finally, a comparative study and analysis is conducted among the three DR techniques, PCA-DR, MNF-DR, and ICA-DR in two applications, endmember extraction and data compression where the proposed ICA-DR has been shown to provide advantages over the PCA-DR and MNF-DR.

594 citations


Book
17 Feb 2006
TL;DR: An Introduction to R What Is R?
Abstract: An Introduction to R What Is R? Installing R Help and Documentation Data Objects in R Data Import and Export Basic Data Manipulation Computing with Data Organizing an Analysis Data Analysis Using Graphical Displays Introduction Initial Data Analysis Analysis Using R Simple Inference Introduction Statistical Tests Analysis Using R Conditional Inference Introduction Conditional Test Procedures Analysis Using R Analysis of Variance Introduction Analysis of Variance Analysis Using R Simple and Multiple Linear Regression Introduction Simple Linear Regression Multiple Linear Regression Analysis Using R Logistic Regression and Generalized Linear Models Introduction Logistic Regression and Generalized Linear Models Analysis Using R Density Estimation Introduction Density Estimation Analysis Using R Recursive Partitioning Introduction Recursive Partitioning Analysis Using R Scatterplot Smoothers and Generalized Additive Models Introduction Scatterplot Smoothers and Generalized Additive Models Analysis Using R Survival Analysis Introduction Survival Analysis Analysis Using R Analyzing Longitudinal Data I Introduction Analyzing Longitudinal Data Linear Mixed Effects Models Analysis Using R Prediction of Random Effects The Problem of Dropouts Analyzing Longitudinal Data II Introduction Methods for Nonnormal Distributions Analysis Using R: GEE Analysis Using R: Random Effects Simultaneous Inference and Multiple Comparisons Introduction Simultaneous Inference and Multiple Comparisons Analysis Using R Meta-Analysis Introduction Systematic Reviews and Meta-Analysis Statistics of Meta-Analysis Analysis Using R Meta-Regression Publication Bias Principal Component Analysis Introduction Principal Component Analysis Analysis Using R Multidimensional Scaling Introduction Multidimensional Scaling Analysis Using R Cluster Analysis Introduction Cluster Analysis Analysis Using R Bibliography Index A Summary appears at the end of each chapter.

591 citations


Book
01 Jan 2006
TL;DR: Elementary concepts in statistics -- Basic statistics and tables -- ANOVA/MANOVA -- Association rules -- Boosting trees -- Canonical analysis -- CHAID analysis -- Classification and regression trees -- Classification trees -- Cluster analysis -- Correspondence analysis -- Data mining techniques -- Discriminant function analysis -- Distribution fitting -- Experimental design.
Abstract: Elementary concepts in statistics -- Basic statistics and tables -- ANOVA/MANOVA -- Association rules -- Boosting trees -- Canonical analysis -- CHAID analysis -- Classification and regression trees (CART) -- Classification trees -- Cluster analysis -- Correspondence analysis -- Data mining techniques -- Discriminant function analysis -- Distribution fitting -- Experimental design (Industrial DOE) -- Factor analysis and principal components -- General discrimination analysis (GDA) -- General linear models (GLM) -- General regression models (GRM) -- Generalized additive models (GAM) -- Generalized linear/nonlinear models (GLZ) -- Log linear analysis of frequency tables -- Machine learning -- Multivariate adaptive regression splines (MARSplines) -- Multidimensional scaling (MDS) -- Multiple linear regression -- Neural networks -- Nonlinear estimation -- Nonparametric statistics -- Partial least squares (PLS) -- Power analysis -- Process analysis -- Quality control charts -- Reliabilty/item analysis -- Structural equation modeling -- Survival/failure time analysis -- Text mining -- Time series/forecasting -- Variance components and mixed model ANOVA/ANCOVA.

586 citations


Journal ArticleDOI
TL;DR: This paper proposes an approximate random projection-based technique to improve the level of privacy protection while still preserving certain statistical characteristics of the data and presents extensive theoretical analysis and experimental results.
Abstract: This paper explores the possibility of using multiplicative random projection matrices for privacy preserving distributed data mining. It specifically considers the problem of computing statistical aggregates like the inner product matrix, correlation coefficient matrix, and Euclidean distance matrix from distributed privacy sensitive data possibly owned by multiple parties. This class of problems is directly related to many other data-mining problems such as clustering, principal component analysis, and classification. This paper makes primary contributions on two different grounds. First, it explores independent component analysis as a possible tool for breaching privacy in deterministic multiplicative perturbation-based models such as random orthogonal transformation and random rotation. Then, it proposes an approximate random projection-based technique to improve the level of privacy protection while still preserving certain statistical characteristics of the data. The paper presents extensive theoretical analysis and experimental results. Experiments demonstrate that the proposed technique is effective and can be successfully used for different types of privacy-preserving data mining applications.

Posted Content
TL;DR: In this article, the authors consider Bayesian regression with normal and double-exponential priors as forecasting methods based on large panels of time series and show that these forecasts are highly correlated with principal component forecasts and that they perform equally well for a wide range of prior choices.
Abstract: This paper considers Bayesian regression with normal and double-exponential priors as forecasting methods based on large panels of time series. We show that, empirically, these forecasts are highly correlated with principal component forecasts and that they perform equally well for a wide range of prior choices. Moreover, we study the asymptotic properties of the Bayesian regression under Gaussian prior under the assumption that data are quasi collinear to establish a criterion for setting parameters in a large cross-section.

Proceedings Article
01 Jan 2006
TL;DR: A practical procedure for applying WCCN to an SVM-based speaker recognition system where the input feature vectors reside in a high-dimensional space and achieves improvements of up to 22% in EER and 28% in minimum decision cost function (DCF) over the previous baseline.
Abstract: This paper extends the within-class covariance normalization (WCCN) technique described in [1, 2] for training generalized linear kernels. We describe a practical procedure for applying WCCN to an SVM-based speaker recognition system where the input feature vectors reside in a high-dimensional space. Our approach involves using principal component analysis (PCA) to split the original feature space into two subspaces: a low-dimensional “PCA space” and a high-dimensional “PCA-complement space.” After performing WCCN in the PCA space, we concatenate the resulting feature vectors with a weighted version of their PCAcomplements. When applied to a state-of-the-art MLLR-SVM speaker recognition system, this approach achieves improvements of up to 22% in EER and 28% in minimum decision cost function (DCF) over our previous baseline. We also achieve substantial improvements over an MLLR-SVM system that performs WCCN in the PCA space but discards the PCA-complement.

Journal ArticleDOI
TL;DR: In this paper, the properties of functional principal component analysis can be elucidated through stochastic expansions and related results, which can be used to explore properties of existing methods, and also to suggest new techniques.
Abstract: Summary. Functional data analysis is intrinsically infinite dimensional; functional principal component analysis reduces dimension to a finite level, and points to the most significant components of the data. However, although this technique is often discussed, its properties are not as well understood as they might be. We show how the properties of functional principal component analysis can be elucidated through stochastic expansions and related results. Our approach quantifies the errors that arise through statistical approximation, in successive terms of orders n−1/2, n−1, n−3/2, …, where n denotes sample size. The expansions show how spacings among eigenvalues impact on statistical performance. The term of size n−1/2 illustrates first-order properties and leads directly to limit theory which describes the dominant effect of spacings. Thus, for example, spacings are seen to have an immediate, first-order effect on properties of eigenfunction estimators, but only a second-order effect on eigenvalue estimators. Our results can be used to explore properties of existing methods, and also to suggest new techniques. In particular, we suggest bootstrap methods for constructing simultaneous confidence regions for an infinite number of eigenvalues, and also for individual eigenvalues and eigenvectors.

Journal ArticleDOI
TL;DR: A novel unsupervised criterion, based on SVD-entropy, selecting a feature according to its contribution to the entropy calculated on a leave-one-out basis is proposed, demonstrating that feature filtering according to CE outperforms the variance method and gene-shaving.
Abstract: Motivation: Many methods have been developed for selecting small informative feature subsets in large noisy data. However, unsupervised methods are scarce. Examples are using the variance of data collected for each feature, or the projection of the feature on the first principal component. We propose a novel unsupervised criterion, based on SVD-entropy, selecting a feature according to its contribution to the entropy (CE) calculated on a leave-one-out basis. This can be implemented in four ways: simple ranking according to CE values (SR); forward selection by accumulating features according to which set produces highest entropy (FS1); forward selection by accumulating features through the choice of the best CE out of the remaining ones (FS2); backward elimination (BE) of features with the lowest CE. Results: We apply our methods to different benchmarks. In each case we evaluate the success of clustering the data in the selected feature spaces, by measuring Jaccard scores with respect to known classifications. We demonstrate that feature filtering according to CE outperforms the variance method and gene-shaving. There are cases where the analysis, based on a small set of selected features, outperforms the best score reported when all information was used. Our method calls for an optimal size of the relevant feature set. This turns out to be just a few percents of the number of genes in the two Leukemia datasets that we have analyzed. Moreover, the most favored selected genes turn out to have significant GO enrichment in relevant cellular processes. Abbreviations: Singular Value Decomposition (SVD), Principal Component Analysis (PCA), Quantum Clustering (QC), Gene Shaving (GS), Variance Selection (VS), Backward Elimination (BE) Contact: royke@cs.huji.ac.il Conflicts of Interest: not reported

Journal ArticleDOI
TL;DR: The PARAFAC decompositions were able to extract the expected features of a previously reported ERP paradigm: namely, a quantitative difference of coherent occipital gamma activity between conditions of a visual paradigm and a qualitative difference which has not previously been reported.

Journal ArticleDOI
TL;DR: Source apportionment of fine particulate matter in Beijing, China, was determined using two eigenvector models, principal component analysis/absolute principal component scores (PCA/APCS) and UNMIX and the results were comparable to previous estimate using the positive matrix factorization (PMF) and chemical mass balance (CMB) receptor models.

Journal ArticleDOI
TL;DR: A new approach to dimensionality reduction, which utilizes a variable number of principal component (PC) axes, produced higher cross-validation assignment rates than either the standard approach of using a fixed number of PC axes or a partial least squares method.
Abstract: Geometric morphometric methods of capturing information about curves or outlines of organismal structures may be used in conjunction with canonical variates analysis (CVA) to assign specimens to groups or populations based on their shapes. This methodological paper examines approaches to optimizing the classification of specimens based on their outlines. This study examines the performance of four approaches to the mathematical representation of outlines and two different approaches to curve measurement as applied to a collection of feather outlines. A new approach to the dimension reduction necessary to carry out a CVA on this type of outline data with modest sample sizes is also presented, and its performance is compared to two other approaches to dimension reduction. Two semi-landmark-based methods, bending energy alignment and perpendicular projection, are shown to produce roughly equal rates of classification, as do elliptical Fourier methods and the extended eigenshape method of outline measurement. Rates of classification were not highly dependent on the number of points used to represent a curve or the manner in which those points were acquired. The new approach to dimensionality reduction, which utilizes a variable number of principal component (PC) axes, produced higher cross-validation assignment rates than either the standard approach of using a fixed number of PC axes or a partial least squares method. Classification of specimens based on feather shape was not highly dependent of the details of the method used to capture shape information. The choice of dimensionality reduction approach was more of a factor, and the cross validation rate of assignment may be optimized using the variable number of PC axes method presented herein.

Posted Content
TL;DR: In this article, the authors proposed to use a fundamental result in random matrix theory, the Marcenko-Pastur equation, to better estimate the eigenvalues of large dimensional covariance matrices.
Abstract: Estimating the eigenvalues of a population covariance matrix from a sample covariance matrix is a problem of fundamental importance in multivariate statistics; the eigenvalues of covariance matrices play a key role in many widely techniques, in particular in Principal Component Analysis (PCA). In many modern data analysis problems, statisticians are faced with large datasets where the sample size, n, is of the same order of magnitude as the number of variables p. Random matrix theory predicts that in this context, the eigenvalues of the sample covariance matrix are not good estimators of the eigenvalues of the population covariance. We propose to use a fundamental result in random matrix theory, the Marcenko-Pastur equation, to better estimate the eigenvalues of large dimensional covariance matrices. The Marcenko-Pastur equation holds in very wide generality and under weak assumptions. The estimator we obtain can be thought of as "shrinking" in a non linear fashion the eigenvalues of the sample covariance matrix to estimate the population eigenvalue. Inspired by ideas of random matrix theory, we also suggest a change of point of view when thinking about estimation of high-dimensional vectors: we do not try to estimate directly the vectors but rather a probability measure that describes them. We think this is a theoretically more fruitful way to think statistically about these problems. Our estimator gives fast and good or very good results in extended simulations. Our algorithmic approach is based on convex optimization. We also show that the proposed estimator is consistent.

Journal ArticleDOI
01 Aug 2006
TL;DR: Experimental results show that the difference of the average recognition accuracy between the proposed incremental method and the batch-mode method is less than 1%.
Abstract: Principal component analysis (PCA) has been proven to be an efficient method in pattern recognition and image analysis. Recently, PCA has been extensively employed for face-recognition algorithms, such as eigenface and fisherface. The encouraging results have been reported and discussed in the literature. Many PCA-based face-recognition systems have also been developed in the last decade. However, existing PCA-based face-recognition systems are hard to scale up because of the computational cost and memory-requirement burden. To overcome this limitation, an incremental approach is usually adopted. Incremental PCA (IPCA) methods have been studied for many years in the machine-learning community. The major limitation of existing IPCA methods is that there is no guarantee on the approximation error. In view of this limitation, this paper proposes a new IPCA method based on the idea of a singular value decomposition (SVD) updating algorithm, namely an SVD updating-based IPCA (SVDU-IPCA) algorithm. In the proposed SVDU-IPCA algorithm, we have mathematically proved that the approximation error is bounded. A complexity analysis on the proposed method is also presented. Another characteristic of the proposed SVDU-IPCA algorithm is that it can be easily extended to a kernel version. The proposed method has been evaluated using available public databases, namely FERET, AR, and Yale B, and applied to existing face-recognition algorithms. Experimental results show that the difference of the average recognition accuracy between the proposed incremental method and the batch-mode method is less than 1%. This implies that the proposed SVDU-IPCA method gives a close approximation to the batch-mode PCA method

Journal ArticleDOI
TL;DR: Experimental results demonstrate that the ICA-AQA performs at least comparably to abundance-constrained methods, which is a high-order statistics-based technique.
Abstract: Independent component analysis (ICA) has shown success in many applications. This paper investigates a new application of the ICA in endmember extraction and abundance quantification for hyperspectral imagery. An endmember is generally referred to as an idealized pure signature for a class whose presence is considered to be rare. When it occurs, it may not appear in large population. In this case, the commonly used principal components analysis may not be effective since endmembers usually contribute very little in statistics to data variance. In order to substantiate the author's findings, an ICA-based approach, called ICA-based abundance quantification algorithm (ICA-AQA) is developed. Three novelties result from the author's proposed ICA-AQA. First, unlike the commonly used least squares abundance-constrained linear spectral mixture analysis (ACLSMA) which is a second-order statistics-based method, the ICA-AQA is a high-order statistics-based technique. Second, due to the use of statistical independency, it is generally thought that the ICA cannot be implemented as a constrained method. The ICA-AQA shows otherwise. Third, in order for the ACLSMA to perform the abundance quantification, it requires an algorithm to find image endmembers first then followed by an abundance-constrained algorithm for quantification. As opposed to such a two-stage process, the ICA-AQA can accomplish endmember extraction and abundance quantification simultaneously in one-shot operation. Experimental results demonstrate that the ICA-AQA performs at least comparably to abundance-constrained methods

Proceedings Article
04 Dec 2006
TL;DR: A PCA-based anomaly detector in which adaptive local data filters send to a coordinator just enough data to enable accurate global detection is developed, based on a stochastic matrix perturbation analysis that characterizes the tradeoff between the accuracy of anomaly detection and the amount of data communicated over the network.
Abstract: We consider the problem of network anomaly detection in large distributed systems. In this setting, Principal Component Analysis (PCA) has been proposed as a method for discovering anomalies by continuously tracking the projection of the data onto a residual subspace. This method was shown to work well empirically in highly aggregated networks, that is, those with a limited number of large nodes and at coarse time scales. This approach, however, has scalability limitations. To overcome these limitations, we develop a PCA-based anomaly detector in which adaptive local data filters send to a coordinator just enough data to enable accurate global detection. Our method is based on a stochastic matrix perturbation analysis that characterizes the tradeoff between the accuracy of anomaly detection and the amount of data communicated over the network.

Journal ArticleDOI
TL;DR: This paper provides a comparison study of three dimension reduction techniques, namely partial least squares (PLS), sliced inverse regression (SIR) and principal component analysis (PCA), and evaluates the relative performance of classification procedures incorporating those methods.
Abstract: An important application of gene expression microarray data is classification of biological samples or prediction of clinical and other outcomes. One necessary part of multivariate statistical analysis in such applications is dimension reduction. This paper provides a comparison study of three dimension reduction techniques, namely partial least squares (PLS), sliced inverse regression (SIR) and principal component analysis (PCA), and evaluates the relative performance of classification procedures incorporating those methods. A five-step assessment procedure is designed for the purpose. Predictive accuracy and computational efficiency of the methods are examined. Two gene expression data sets for tumor classification are used in the study.

Journal ArticleDOI
TL;DR: In this paper, a robust projection-pursuit method for principal component analysis (PCA) is proposed for the analysis of chemical data, where the number of variables is typically large.
Abstract: Principal Component Analysis (PCA) is very sensitive in presence of outliers. One of the most appealing robust methods for principal component analysis uses the Projection-Pursuit principle. Here, one projects the data on a lower-dimensional space such that a robust measure of variance of the projected data will be maximized. The Projection-Pursuit based method for principal component analysis has recently been introduced in the field of chemometrics, where the number of variables is typically large. In this paper, it is shown that the currently available algorithm for robust Projection-Pursuit PCA performs poor in presence of many variables. A new algorithm is proposed that is more suitable for the analysis of chemical data. Its performance is studied by means of simulation experiments and illustrated on some real datasets.

Journal ArticleDOI
TL;DR: In this paper, the authors present a complete multivariate statistical process control (MSPC) application method that combines recent contributions to the field, including multiway principal component analysis (PCA), recursive PCA, fault detection using a combined index, and fault contributions from Hotelling's T/sup 2/ statistic.
Abstract: The purposes of multivariate statistical process control (MSPC) are to improve process operations by quickly detecting when process abnormalities have occurred and diagnosing the sources of the process abnormalities. In the area of semiconductor manufacturing, increased yield and improved product quality result from reducing the amount of wafers produced under suboptimal operating conditions. This paper presents a complete MSPC application method that combines recent contributions to the field, including multiway principal component analysis (PCA), recursive PCA, fault detection using a combined index, and fault contributions from Hotelling's T/sup 2/ statistic. In addition, a method for determining multiblock fault contributions to the combined index is introduced. The effectiveness of the system is demonstrated using postlithography metrology data and plasma stripper processing tool data.

Journal ArticleDOI
TL;DR: A multivariate extension of the well known wavelet denoising procedure widely examined for scalar valued signals, that combines a straightforward multivariate generalization of a classical one and principal component analysis is proposed.

Journal ArticleDOI
TL;DR: A numerical model selection heuristic based on a convex hull is proposed and results show that this heuristic performs almost perfectly, except for Tucker3 data arrays with at least one small mode and a relatively large amount of error.
Abstract: Several three-mode principal component models can be considered for the modelling of three-way, three-mode data, including the Candecomp/Parafac, Tucker3, Tucker2, and Tucker I models. The following question then may be raised: given a specific data set, which of these models should be selected, and at what complexity (i.e. with how many components)? We address this question by proposing a numerical model selection heuristic based on a convex hull. Simulation results show that this heuristic performs almost perfectly, except for Tucker3 data arrays with at least one small mode and a relatively large amount of error.

Journal ArticleDOI
TL;DR: It is proposed to use as covariates of the logistic model a reduced set of optimum principal components of the original predictors, to improve the estimation of thelogistic model parameters under multicollinearity and to reduce the dimension of the problem with continuous covariates.

Journal ArticleDOI
28 Feb 2006-Talanta
TL;DR: The use of genetic algorithms (GA) for variable selection methods was found to enhance the classification performance of the PLS-DA models, and various metabolites were identified that are responsible for the observed separations.

Journal ArticleDOI
TL;DR: This work discusses potential biases imposed by the application of ANCOVA and residuals analysis for quantifying morphological differences, and elaborate and demonstrate a more effective alternative: common principal components analysis combined with Burnaby’s back-projection method.
Abstract: Morphological relationships change with overall body size and body size often varies among populations. Therefore, quantitative analyses of individual traits from organisms in different populations or environments (e.g., in studies of phenotypic plasticity) often adjust for differences in body size to isolate changes in allometry. Most studies of among population variation in morphology either (1) use analysis of covariance (ANCOVA) with a univariate measure of body size as the covariate, or (2) compare residuals from ordinary least squares regression of each trait against body size or the first principal component of the pooled data (shearing). However, both approaches are problematic. ANCOVA depends on assumptions (small variance in the covariate) that are frequently violated in this context. Residuals analysis assumes that scaling relationships within groups are equal, but this assumption is rarely tested. Furthermore, scaling relationships obtained from pooled data typically mischaracterize within-group scaling relationships. We discuss potential biases imposed by the application of ANCOVA and residuals analysis for quantifying morphological differences, and elaborate and demonstrate a more effective alternative: common principal components analysis combined with Burnaby's back-projection method.