Showing papers on "Principal component analysis published in 2009"

PDF

Open Access

Posted Content•

Online Learning for Matrix Factorization and Sparse Coding

[...]

Julien Mairal¹, Francis Bach¹, Jean Ponce¹, Guillermo Sapiro•Institutions (1)

French Institute for Research in Computer Science and Automation¹

01 Aug 2009-arXiv: Machine Learning

TL;DR: A new online optimization algorithm is proposed, based on stochastic approximations, which scales up gracefully to large data sets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems.

...read moreread less

Abstract: Sparse coding--that is, modelling data vectors as sparse linear combinations of basis elements--is widely used in machine learning, neuroscience, signal processing, and statistics. This paper focuses on the large-scale matrix factorization problem that consists of learning the basis set, adapting it to specific data. Variations of this problem include dictionary learning in signal processing, non-negative matrix factorization and sparse principal component analysis. In this paper, we propose to address these tasks with a new online optimization algorithm, based on stochastic approximations, which scales up gracefully to large datasets with millions of training samples, and extends naturally to various matrix factorization formulations, making it suitable for a wide range of learning problems. A proof of convergence is presented, along with experiments with natural images and genomic data demonstrating that it leads to state-of-the-art performance in terms of speed and optimization for both small and large datasets.

...read moreread less

2,256 citations

Principal Component Analysis.

[...]

Heng Tao Shen

01 Jan 2009

2,204 citations

Proceedings Article•

Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization

[...]

John Wright¹, Arvind Ganesh², Shankar R. Rao², Yigang Peng¹, Yi Ma¹ - Show less +1 more•Institutions (2)

Microsoft¹, University of Illinois at Urbana–Champaign²

07 Dec 2009

TL;DR: It is proved that most matrices A can be efficiently and exactly recovered from most error sign-and-support patterns by solving a simple convex program, for which it is given a fast and provably convergent algorithm.

...read moreread less

Abstract: Principal component analysis is a fundamental operation in computational data analysis, with myriad applications ranging from web search to bioinformatics to computer vision and image analysis. However, its performance and applicability in real scenarios are limited by a lack of robustness to outlying or corrupted observations. This paper considers the idealized "robust principal component analysis" problem of recovering a low rank matrix A from corrupted observations D = A + E. Here, the corrupted entries E are unknown and the errors can be arbitrarily large (modeling grossly corrupted observations common in visual and bioinformatic data), but are assumed to be sparse. We prove that most matrices A can be efficiently and exactly recovered from most error sign-and-support patterns by solving a simple convex program, for which we give a fast and provably convergent algorithm. Our result holds even when the rank of A grows nearly proportionally (up to a logarithmic factor) to the dimensionality of the observation space and the number of errors E grows in proportion to the total number of entries in the matrix. A by-product of our analysis is the first proportional growth results for the related problem of completing a low-rank matrix from a small fraction of its entries. Simulations and real-data examples corroborate the theoretical results, and suggest potential applications in computer vision.

...read moreread less

1,479 citations

Book•

Introduction to Multivariate Statistical Analysis in Chemometrics

[...]

Kurt Varmuza, Peter Filzmoser¹•Institutions (1)

Vienna University of Technology¹

17 Feb 2009

TL;DR: Introduction Chemoinformatics-Chemometrics-Statistics This Book Historical Remarks about Chemometrics Bibliography Starting Examples Univariate Statistics-A Reminder Multivariate Data Definitions Basic Preprocessing Covariance and Correlation Distances and Similarities Multivariate Outlier Identification Linear Latent Variables

...read moreread less

Abstract: Introduction Chemoinformatics-Chemometrics-Statistics This Book Historical Remarks about Chemometrics Bibliography Starting Examples Univariate Statistics-A Reminder Multivariate Data Definitions Basic Preprocessing Covariance and Correlation Distances and Similarities Multivariate Outlier Identification Linear Latent Variables Summary Principal Component Analysis (PCA) Concepts Number of PCA Components Centering and Scaling Outliers and Data Distribution Robust PCA Algorithms for PCA Evaluation and Diagnostics Complementary Methods for Exploratory Data Analysis Examples Summary Calibration Concepts Performance of Regression Models Ordinary Least Squares Regression Robust Regression Variable Selection Principal Component Regression Partial Least Squares Regression Related Methods Examples Summary Classification Concepts Linear Classification Methods Kernel and Prototype Methods Classification Trees Artificial Neural Networks Support Vector Machine Evaluation Examples Summary Cluster Analysis Concepts Distance and Similarity Measures Partitioning Methods Hierarchical Clustering Methods Fuzzy Clustering Model-Based Clustering Cluster Validity and Clustering Tendency Measures Examples Summary Preprocessing Concepts Smoothing and Differentiation Multiplicative Signal Correction Mass Spectral Features Appendix 1: Symbols and Abbreviations Appendix 2: Matrix Algebra Appendix 3: Introduction to R Index References appear at the end of each chapter

...read moreread less

1,003 citations

Journal Article•DOI•

On Consistency and Sparsity for Principal Components Analysis in High Dimensions

[...]

Iain M. Johnstone¹, Arthur Yu Lu•Institutions (1)

Renaissance Technologies¹

01 Jun 2009-Journal of the American Statistical Association

TL;DR: A simple algorithm for selecting a subset of coordinates with largest sample variances is provided, and it is shown that if PCA is done on the selected subset, then consistency is recovered, even if p(n) ≫ n.

...read moreread less

Abstract: Principal components analysis (PCA) is a classic method for the reduction of dimensionality of data in the form of n observations (or cases) of a vector with p variables. Contemporary datasets often have p comparable with or even much larger than n. Our main assertions, in such settings, are (a) that some initial reduction in dimensionality is desirable before applying any PCA-type search for principal modes, and (b) the initial reduction in dimensionality is best achieved by working in a basis in which the signals have a sparse representation. We describe a simple asymptotic model in which the estimate of the leading principal component vector via standard PCA is consistent if and only if p(n)/n → 0. We provide a simple algorithm for selecting a subset of coordinates with largest sample variances, and show that if PCA is done on the selected subset, then consistency is recovered, even if p(n) ≫ n.

...read moreread less

937 citations

Journal Article•DOI•

Unsupervised Change Detection in Satellite Images Using Principal Component Analysis and $k$ -Means Clustering

[...]

Turgay Celik¹•Institutions (1)

National University of Singapore¹

07 Aug 2009-IEEE Geoscience and Remote Sensing Letters

TL;DR: A novel technique for unsupervised change detection in multitemporal satellite images using principal component analysis (PCA) and k-means clustering and Experimental results confirm the effectiveness of the proposed approach.

...read moreread less

Abstract: In this letter, we propose a novel technique for unsupervised change detection in multitemporal satellite images using principal component analysis (PCA) and k-means clustering. The difference image is partitioned into h times h nonoverlapping blocks. S, S les h2, orthonormal eigenvectors are extracted through PCA of h times h nonoverlapping block set to create an eigenvector space. Each pixel in the difference image is represented with an S-dimensional feature vector which is the projection of h times h difference image data onto the generated eigenvector space. The change detection is achieved by partitioning the feature vector space into two clusters using k-means clustering with k = 2 and then assigning each pixel to the one of the two clusters by using the minimum Euclidean distance between the pixel's feature vector and mean feature vector of clusters. Experimental results confirm the effectiveness of the proposed approach.

...read moreread less

817 citations

Journal Article•DOI•

CUR matrix decompositions for improved data analysis

[...]

Michael W. Mahoney¹, Petros Drineas²•Institutions (2)

Stanford University¹, Rensselaer Polytechnic Institute²

20 Jan 2009-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: An algorithm is presented that preferentially chooses columns and rows that exhibit high “statistical leverage” and exert a disproportionately large “influence” on the best low-rank fit of the data matrix, obtaining improved relative-error and constant-factor approximation guarantees in worst-case analysis, as opposed to the much coarser additive-error guarantees of prior work.

...read moreread less

Abstract: Principal components analysis and, more generally, the Singular Value Decomposition are fundamental data analysis tools that express a data matrix in terms of a sequence of orthogonal or uncorrelated vectors of decreasing importance. Unfortunately, being linear combinations of up to all the data points, these vectors are notoriously difficult to interpret in terms of the data and processes generating the data. In this article, we develop CUR matrix decompositions for improved data analysis. CUR decompositions are low-rank matrix decompositions that are explicitly expressed in terms of a small number of actual columns and/or actual rows of the data matrix. Because they are constructed from actual data elements, CUR decompositions are interpretable by practitioners of the field from which the data are drawn (to the extent that the original data are). We present an algorithm that preferentially chooses columns and rows that exhibit high “statistical leverage” and, thus, in a very precise statistical sense, exert a disproportionately large “influence” on the best low-rank fit of the data matrix. By selecting columns and rows in this manner, we obtain improved relative-error and constant-factor approximation guarantees in worst-case analysis, as opposed to the much coarser additive-error guarantees of prior work. In addition, since the construction involves computing quantities with a natural and widely studied statistical interpretation, we can leverage ideas from diagnostic regression analysis to employ these matrix decompositions for exploratory data analysis.

...read moreread less

815 citations

Journal Article•DOI•

A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data

[...]

Bjoern H. Menze¹, B. Michael Kelm¹, Ralf Masuch, Uwe Himmelreich², Peter Bachert³, Wolfgang Petrich⁴, Wolfgang Petrich⁵, Fred A. Hamprecht⁵, Fred A. Hamprecht¹ - Show less +5 more•Institutions (5)

Interdisciplinary Center for Scientific Computing¹, Katholieke Universiteit Leuven², German Cancer Research Center³, Hoffmann-La Roche⁴, Heidelberg University⁵

10 Jul 2009-BMC Bioinformatics

TL;DR: The Gini importance of the random forest provided superior means for measuring feature relevance on spectral data, but – on an optimal subset of features – the regularized classifiers might be preferable over the random Forest classifier, in spite of their limitation to model linear dependencies only.

...read moreread less

Abstract: Regularized regression methods such as principal component or partial least squares regression perform well in learning tasks on high dimensional spectral data, but cannot explicitly eliminate irrelevant features. The random forest classifier with its associated Gini feature importance, on the other hand, allows for an explicit feature elimination, but may not be optimally adapted to spectral data due to the topology of its constituent classification trees which are based on orthogonal splits in feature space. We propose to combine the best of both approaches, and evaluated the joint use of a feature selection based on a recursive feature elimination using the Gini importance of random forests' together with regularized classification methods on spectral data sets from medical diagnostics, chemotaxonomy, biomedical analytics, food science, and synthetically modified spectral data. Here, a feature selection using the Gini feature importance with a regularized classification by discriminant partial least squares regression performed as well as or better than a filtering according to different univariate statistical tests, or using regression coefficients in a backward feature elimination. It outperformed the direct application of the random forest classifier, or the direct application of the regularized classifiers on the full set of features. The Gini importance of the random forest provided superior means for measuring feature relevance on spectral data, but – on an optimal subset of features – the regularized classifiers might be preferable over the random forest classifier, in spite of their limitation to model linear dependencies only. A feature selection based on Gini importance, however, may precede a regularized linear classification to identify this optimal subset of features, and to earn a double benefit of both dimensionality reduction and the elimination of noise from the classification task.

...read moreread less

726 citations

Journal Article•DOI•

Socioeconomic status measurement with discrete proxy variables: is principal component analysis a reliable answer?

[...]

Stanislav Kolenikov¹, Gustavo Angeles²•Institutions (2)

University of Missouri¹, University of North Carolina at Chapel Hill²

01 Mar 2009-Review of Income and Wealth

TL;DR: In this paper, the effects of discreteness of the observed variables on principal component analysis (PCA) are reviewed and the statistical properties of the popular Filmer and Pritchett (2001) procedure are analyzed.

...read moreread less

Abstract: The last several years have seen a growth in the number of publications in economics that use principal component analysis (PCA) in the area of welfare studies. This paper explores the ways discrete data can be incorporated into PCA. The effects of discreteness of the observed variables on the PCA are reviewed. The statistical properties of the popular Filmer and Pritchett (2001) procedure are analyzed. The concepts of polychoric and polyserial correlations are introduced with appropriate references to the existing literature demonstrating their statistical properties. A large simulation study is carried out to compare various implementations of discrete data PCA. The simulation results show that the currently used method of running PCA on a set of dummy variables as proposed by Filmer and Pritchett (2001) can be improved upon by using procedures appropriate for discrete data, such as retaining the ordinal variables without breaking them into a set of dummy variables or using polychoric correlations. An empirical example using Bangladesh 2000 Demographic and Health Survey data helps in explaining the differences between procedures.

...read moreread less

712 citations

Journal Article•DOI•

Size-correction and principal components for interspecific comparative studies

[...]

Liam J. Revell¹•Institutions (1)

Harvard University¹

08 Dec 2009-Evolution

TL;DR: It is shown that ignoring phylogeny in preliminary transformations can result in significantly elevated variance and type I error in the authors' statistical estimators, even if subsequent analysis of the transformed data is performed using phylogenetic methods.

...read moreread less

Abstract: Phylogenetic methods for the analysis of species data are widely used in evolutionary studies. However, preliminary data transformations and data reduction procedures (such as a size-correction and principal components analysis, PCA) are often performed without first correcting for nonindependence among the observations for species. In the present short comment and attached R and MATLAB code, I provide an overview of statistically correct procedures for phylogenetic size-correction and PCA. I also show that ignoring phylogeny in preliminary transformations can result in significantly elevated variance and type I error in our statistical estimators, even if subsequent analysis of the transformed data is performed using phylogenetic methods. This means that ignoring phylogeny during preliminary data transformations can possibly lead to spurious results in phylogenetic statistical analyses of species data.

...read moreread less

698 citations

Journal Article•DOI•

Partial least squares regression as an alternative to current regression methods used in ecology

[...]

Luis M. Carrascal¹, Ismael Galván, Oscar Gordo•Institutions (1)

Spanish National Research Council¹

01 May 2009-Oikos

TL;DR: Partial least squares regression analysis (PLSR) as mentioned in this paper is a statistical technique particularly well suited to analyzing a large array of related predictor variables (i.e. not truly independent), with a sample size not large enough compared to the number of independent variables, and in cases in which an attempt is made to approach complex phenomena or syndromes that must be defined as a combination of several variables obtained independently.

...read moreread less

Abstract: This paper briefly presents the aims, requirements and results of partial least squares regression analysis (PLSR), and its potential utility in ecological studies. This statistical technique is particularly well suited to analyzing a large array of related predictor variables (i.e. not truly independent), with a sample size not large enough compared to the number of independent variables, and in cases in which an attempt is made to approach complex phenomena or syndromes that must be defined as a combination of several variables obtained independently. A simulation experiment is carried out to compare this technique with multiple regression (MR) and with a combination of principal component analysis and multiple regression (PCA+MR), varying the number of predictor variables and sample sizes. PLSR models explained a similar amount of variance to those results obtained by MR and PCA+MR. However, PLSR was more reliable than other techniques when identifying relevant variables and their magnitudes of influence, especially in cases of small sample size and low tolerance. Finally, we present one example of PLSR to illustrate its application and interpretation in ecology.

...read moreread less

Proceedings Article•DOI•

Human detection using partial least squares analysis

[...]

William Robson Schwartz¹, Aniruddha Kembhavi¹, David Harwood¹, Larry S. Davis¹•Institutions (1)

University of Maryland, College Park¹

01 Sep 2009

TL;DR: This paper describes a human detection method that augments widely used edge-based features with texture and color information, providing us with a much richer descriptor set, and is shown to outperform state-of-the-art techniques on three varied datasets.

...read moreread less

Abstract: Significant research has been devoted to detecting people in images and videos. In this paper we describe a human detection method that augments widely used edge-based features with texture and color information, providing us with a much richer descriptor set. This augmentation results in an extremely high-dimensional feature space (more than 170,000 dimensions). In such high-dimensional spaces, classical machine learning algorithms such as SVMs are nearly intractable with respect to training. Furthermore, the number of training samples is much smaller than the dimensionality of the feature space, by at least an order of magnitude. Finally, the extraction of features from a densely sampled grid structure leads to a high degree of multicollinearity. To circumvent these data characteristics, we employ Partial Least Squares (PLS) analysis, an efficient dimensionality reduction technique, one which preserves significant discriminative information, to project the data onto a much lower dimensional subspace (20 dimensions, reduced from the original 170,000). Our human detection system, employing PLS analysis over the enriched descriptor set, is shown to outperform state-of-the-art techniques on three varied datasets including the popular INRIA pedestrian dataset, the low-resolution gray-scale DaimlerChrysler pedestrian dataset, and the ETHZ pedestrian dataset consisting of full-length videos of crowded scenes.

...read moreread less

Journal Article•DOI•

A Genealogical Interpretation of Principal Components Analysis

[...]

Gil McVean¹•Institutions (1)

University of Oxford¹

16 Oct 2009-PLOS Genetics

TL;DR: For SNP data the projection of samples onto the principal components can be obtained directly from considering the average coalescent times between pairs of haploid genomes, which provides a framework for interpreting PCA projections in terms of underlying processes, including migration, geographical isolation, and admixture.

...read moreread less

Abstract: Principal components analysis, PCA, is a statistical method commonly used in population genetics to identify structure in the distribution of genetic variation across geographical location and ethnic background. However, while the method is often used to inform about historical demographic processes, little is known about the relationship between fundamental demographic parameters and the projection of samples onto the primary axes. Here I show that for SNP data the projection of samples onto the principal components can be obtained directly from considering the average coalescent times between pairs of haploid genomes. The result provides a framework for interpreting PCA projections in terms of underlying processes, including migration, geographical isolation, and admixture. I also demonstrate a link between PCA and Wright's f(st) and show that SNP ascertainment has a largely simple and predictable effect on the projection of samples. Using examples from human genetics, I discuss the application of these results to empirical data and the implications for inference.

...read moreread less

BM3D Image Denoising with Shape-Adaptive Principal Component Analysis

[...]

Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, Karen Egiazarian

06 Apr 2009

TL;DR: This work proposes an image denoising method that ex- ploits nonlocal image modeling, principal component analysis (PCA), and local shape-adaptive anisotropic estimation and shows that the proposed method is competitive and outperforms some of the current best Denoising methods, especially in preserving image details and introducing very few artifacts.

...read moreread less

Abstract: We propose an image denoising method that ex- ploits nonlocal image modeling, principal component analysis (PCA), and local shape-adaptive anisotropic estimation. The nonlocal modeling is exploited by grouping similar image patches in 3-D groups. The denoising is performed by shrinkage of the spectrum of a 3-D transform applied on such groups. The effectiveness of the shrinkage depends on the ability of the transform to sparsely represent the true-image data, thus separating it from the noise. We propose to improve the sparsity in two aspects. First, we employ image patches (neighborhoods) which can have data-adaptive shape. Second, we propose PCA on these adaptive-shape neighborhoods as part of the employed 3-D transform. The PCA bases are obtained by eigenvalue decompo- sition of empirical second-moment matrices that are estimated from groups of similar adaptive-shape neighborhoods. We show that the proposed method is competitive and outperforms some of the current best denoising methods, especially in preserving image details and introducing very few artifacts.

...read moreread less

Journal Article•DOI•

A Generic and Robust System for Automated Patient-Specific Classification of ECG Signals

[...]

Turker Ince¹, Serkan Kiranyaz², Moncef Gabbouj²•Institutions (2)

İzmir University of Economics¹, Tampere University of Technology²

06 Feb 2009-IEEE Transactions on Biomedical Engineering

TL;DR: This paper presents a generic and patient-specific classification system designed for robust and accurate detection of ECG heartbeat patterns that can adapt to significant interpatient variations in ECG patterns by training the optimal network structure, and achieves higher accuracy over larger datasets.

...read moreread less

Abstract: This paper presents a generic and patient-specific classification system designed for robust and accurate detection of ECG heartbeat patterns. The proposed feature extraction process utilizes morphological wavelet transform features, which are projected onto a lower dimensional feature space using principal component analysis, and temporal features from the ECG data. For the pattern recognition unit, feedforward and fully connected artificial neural networks, which are optimally designed for each patient by the proposed multidimensional particle swarm optimization technique, are employed. By using relatively small common and patient-specific training data, the proposed classification system can adapt to significant interpatient variations in ECG patterns by training the optimal network structure, and thus, achieves higher accuracy over larger datasets. The classification experiments over a benchmark database demonstrate that the proposed system achieves such average accuracies and sensitivities better than most of the current state-of-the-art algorithms for detection of ventricular ectopic beats (VEBs) and supra-VEBs (SVEBs). Over the entire database, the average accuracy-sensitivity performances of the proposed system for VEB and SVEB detections are 98.3%-84.6% and 97.4%-63.5%, respectively. Finally, due to its parameter-invariant nature, the proposed system is highly generic, and thus, applicable to any ECG dataset.

...read moreread less

Book•

Chemometrics for Pattern Recognition

[...]

Richard G. Brereton

28 Sep 2009

TL;DR: This book presents a meta-analysis of Mouse Urine Spectroscopy for Salival Analysis of the Effect of Mouthwash, which highlights the importance of knowing the carrier and removal status of the gas molecule.

...read moreread less

Abstract: Acknowledgements. Preface. 1 Introduction. 1.1 Past, Present and Future. 1.2 About this Book. Bibliography. 2 Case Studies. 2.1 Introduction. 2.2 Datasets, Matrices and Vectors. 2.3 Case Study 1: Forensic Analysis of Banknotes. 2.4 Case Study 2: Near Infrared Spectroscopic Analysis of Food. 2.5 Case Study 3: Thermal Analysis of Polymers. 2.6 Case Study 4: Environmental Pollution using Headspace Mass Spectrometry. 2.7 Case Study 5: Human Sweat Analysed by Gas Chromatography Mass Spectrometry. 2.8 Case Study 6: Liquid Chromatography Mass Spectrometry of Pharmaceutical Tablets. 2.9 Case Study 7: Atomic Spectroscopy for the Study of Hypertension. 2.10 Case Study 8: Metabolic Profiling of Mouse Urine by Gas Chromatography of Urine Extracts. 2.11 Case Study 9: Nuclear Magnetic Resonance Spectroscopy for Salival Analysis of the Effect of Mouthwash. 2.12 Case Study 10: Simulations. 2.13 Case Study 11: Null Dataset. 2.14 Case Study 12: GCMS and Microbiology of Mouse Scent Marks. Bibliography. 3 Exploratory Data Analysis. 3.1 Introduction. 3.2 Principal Components Analysis. 3.2.1 Background. 3.2.2 Scores and Loadings. 3.2.3 Eigenvalues. 3.2.4 PCA Algorithm. 3.2.5 Graphical Representation. 3.3 Dissimilarity Indices, Principal Co-ordinates Analysis and Ranking. 3.3.1 Dissimilarity. 3.3.2 Principal Co-ordinates Analysis. 3.3.3 Ranking. 3.4 Self Organizing Maps. 3.4.1 Background. 3.4.2 SOM Algorithm. 3.4.3 Initialization. 3.4.4 Training. 3.4.5 Map Quality. 3.4.6 Visualization. Bibliography. 4 Preprocessing. 4.1 Introduction. 4.2 Data Scaling. 4.2.1 Transforming Individual Elements. 4.2.2 Row Scaling. 4.2.3 Column Scaling. 4.3 Multivariate Methods of Data Reduction. 4.3.1 Largest Principal Components. 4.3.2 Discriminatory Principal Components. 4.3.3 Partial Least Squares Discriminatory Analysis Scores. 4.4 Strategies for Data Preprocessing. 4.4.1 Flow Charts. 4.4.2 Level 1. 4.4.3 Level 2. 4.4.4 Level 3. 4.4.5 Level 4. Bibliography. 5 Two Class Classifiers. 5.1 Introduction. 5.1.1 Two Class Classifiers. 5.1.2 Preprocessing. 5.1.3 Notation. 5.1.4 Autoprediction and Class Boundaries. 5.2 Euclidean Distance to Centroids. 5.3 Linear Discriminant Analysis. 5.4 Quadratic Discriminant Analysis. 5.5 Partial Least Squares Discriminant Analysis. 5.5.1 PLS Method. 5.5.2 PLS Algorithm. 5.5.3 PLS-DA. 5.6 Learning Vector Quantization. 5.6.1 Voronoi Tesselation and Codebooks. 5.6.2 LVQ1. 5.6.3 LVQ3. 5.6.4 LVQ Illustration and Summary of Parameters. 5.7 Support Vector Machines. 5.7.1 Linear Learning Machines. 5.7.2 Kernels. 5.7.3 Controlling Complexity and Soft Margin SVMs. 5.7.4 SVM Parameters. Bibliography. 6 One Class Classifiers. 6.1 Introduction. 6.2 Distance Based Classifiers. 6.3 PC Based Models and SIMCA. 6.4 Indicators of Significance. 6.4.1 Gaussian Density Estimators and Chi-Squared. 6.4.2 Hotelling's T 2 . 6.4.3 D-Statistic. 6.4.4 Q-Statistic or Squared Prediction Error. 6.4.5 Visualization of D- and Q-Statistics for Disjoint PC Models. 6.4.6 Multivariate Normality and What to do if it Fails. 6.5 Support Vector Data Description. 6.6 Summarizing One Class Classifiers. 6.6.1 Class Membership Plots. 6.6.2 ROC Curves. Bibliography. 7 Multiclass Classifiers. 7.1 Introduction. 7.2 EDC, LDA and QDA. 7.3 LVQ. 7.4 PLS. 7.4.1 PLS2. 7.4.2 PLS1. 7.5 SVM. 7.6 One against One Decisions. Bibliography. 8 Validation and Optimization. 8.1 Introduction. 8.1.1 Validation. 8.1.2 Optimization. 8.2 Classification Abilities, Contingency Tables and Related Concepts. 8.2.1 Two Class Classifiers. 8.2.2 Multiclass Classifiers. 8.2.3 One Class Classifiers. 8.3 Validation. 8.3.1 Testing Models. 8.3.2 Test and Training Sets. 8.3.3 Predictions. 8.3.4 Increasing the Number of Variables for the Classifier. 8.4 Iterative Approaches for Validation. 8.4.1 Predictive Ability, Model Stability, Classification by Majority Vote and Cross Classification Rate. 8.4.2 Number of Iterations. 8.4.3 Test and Training Set Boundaries. 8.5 Optimizing PLS Models. 8.5.1 Number of Components: Cross-Validation and Bootstrap. 8.5.2 Thresholds and ROC Curves. 8.6 Optimizing Learning Vector Quantization Models. 8.7 Optimizing Support Vector Machine Models. Bibliography. 9 Determining Potential Discriminatory Variables. 9.1 Introduction. 9.1.1 Two Class Distributions. 9.1.2 Multiclass Distributions. 9.1.3 Multilevel and Multiway Distributions. 9.1.4 Sample Sizes. 9.1.5 Modelling after Variable Reduction. 9.1.6 Preliminary Variable Reduction. 9.2 Which Variables are most Significant?. 9.2.1 Basic Concepts: Statistical Indicators and Rank. 9.2.2 T-Statistic and Fisher Weights. 9.2.3 Multiple Linear Regression, ANOVA and the F-Ratio. 9.2.4 Partial Least Squares. 9.2.5 Relationship between the Indicator Functions. 9.3 How Many Variables are Significant? 9.3.1 Probabilistic Approaches. 9.3.2 Empirical Methods: Monte Carlo. 9.3.3 Cost/Benefit of Increasing the Number of Variables. Bibliography. 10 Bayesian Methods and Unequal Class Sizes. 10.1 Introduction. 10.2 Contingency Tables and Bayes' Theorem. 10.3 Bayesian Extensions to Classifiers. Bibliography. 11 Class Separation Indices. 11.1 Introduction. 11.2 Davies Bouldin Index. 11.3 Silhouette Width and Modified Silhouette Width. 11.3.1 Silhouette Width. 11.3.2 Modified Silhouette Width. 11.4 Overlap Coefficient. Bibliography. 12 Comparing Different Patterns. 12.1 Introduction. 12.2 Correlation Based Methods. 12.2.1 Mantel Test. 12.2.2 R V Coefficient. 12.3 Consensus PCA. 12.4 Procrustes Analysis. Bibliography. Index.

...read moreread less

Journal Article•DOI•

Principal component analysis for compositional data with outliers

[...]

Peter Filzmoser¹, Karel Hron, Clemens Reimann•Institutions (1)

Vienna University of Technology¹

01 Sep 2009-Environmetrics

TL;DR: It turns out that the procedure using ilr‐transformed data and robust PCA delivers superior results to all other approaches, demonstrating that due to the compositional nature of geochemical data PCA should be carried out without an appropriate transformation.

...read moreread less

Abstract: SUMMARY Compositional data (almost all data in geochemistry) are closed data, that is they usually sum up to a constant (e.g. weight percent, wt.%) and carry only relative information. Thus, the covariance structure of compositional data is strongly biased and results of many multivariate techniques become doubtful without a proper transformation of the data. The centred logratio transformation (clr) is often used to open closed data. However the transformed data do not have full rank following a logratio transformation and cannot be used for robust multivariate techniques like principal component analysis (PCA). Here we propose to use the isometric logratio transformation (ilr) instead. However, the ilr transformation has the disadvantage that the resulting new variables are no longer directly interpretable in terms of the originally entered variables. Here we propose a technique how the resulting scores and loadings of a robust PCA on ilr transformed data can be back-transformed and interpreted. The procedure is demonstrated using a real data set from regional geochemistry and compared to results from non-transformed and non-robust versions of PCA. It turns out that the procedure using ilr-transformed data and robust PCA delivers superior results to all other approaches. The examples demonstrate that due to the compositional nature of geochemical data PCA should not be carried out without an appropriate transformation. Furthermore a robust approach is preferable if the dataset contains outliers. Copyright © 2009 John Wiley & Sons, Ltd.

...read moreread less

Journal Article•DOI•

A Randomized Algorithm for Principal Component Analysis

[...]

Vladimir Rokhlin¹, Arthur Szlam¹, Mark Tygert•Institutions (1)

Yale University¹

01 Aug 2009-SIAM Journal on Matrix Analysis and Applications

TL;DR: In this article, the authors describe an efficient algorithm for low-rank approximation of matrices that produces accuracy that is very close to the best possible accuracy, for matrices of arbitrary sizes.

...read moreread less

Abstract: Principal component analysis (PCA) requires the computation of a low-rank approximation to a matrix containing the data being analyzed. In many applications of PCA, the best possible accuracy of any rank-deficient approximation is at most a few digits (measured in the spectral norm, relative to the spectral norm of the matrix being approximated). In such circumstances, efficient algorithms have not come with guarantees of good accuracy, unless one or both dimensions of the matrix being approximated are small. We describe an efficient algorithm for the low-rank approximation of matrices that produces accuracy that is very close to the best possible accuracy, for matrices of arbitrary sizes. We illustrate our theoretical results via several numerical examples.

...read moreread less

Journal Article•DOI•

Total projection to latent structures for process monitoring

[...]

Donghua Zhou¹, Gang Li¹, S. Joe Qin²•Institutions (2)

Tsinghua University¹, University of Southern California²

26 Aug 2009-Aiche Journal

TL;DR: Wang et al. as discussed by the authors proposed a total projection to latent structures (T-PLS) algorithm, which divides the X-space into four parts instead of two parts in standard PLS.

...read moreread less

Abstract: Partial least squares or projection to latent structures (PLS) has been used in multivariate statistical process monitoring similar to principal component analysis. Standard PLS often requires many components or latent variables (LVs), which contain variations orthogonal to Y and useless for predicting Y. Further, the X-residual of PLS usually has quite large variations, thus is not proper to monitor with the Q-statistic. To reduce false alarm and missing alarm rates of faults related to Y, a total projection to latent structures (T-PLS) algorithm is proposed in this article. The new structure divides the X-space into four parts instead of two parts in standard PLS. The properties of T-PLS are studied in detail, including its relationship to the orthogonal PLS. Further study shows the space decomposition on X-space induced by T-PLS. Fault detection policy is developed based on the T-PLS. Case studies on two simulation dectection policy is developed based on the T-PLS. Case studies on two simulation examples show the effectiveness of the T-PLS based fault detection methods.

...read moreread less

Journal Article•DOI•

Amino acid side chain parameters for correlation studies in biology and pharmacology.

[...]

Jean-Luc Fauchère¹, Marvin Charton², Lemont B. Kier³, Arie Verloop, Vladimir Pliška¹ - Show less +1 more•Institutions (3)

École Polytechnique Fédérale de Lausanne¹, Pratt Institute², Virginia Commonwealth University³

12 Jan 2009-International Journal of Peptide and Protein Research

TL;DR: Fifteen physicochemical descriptors of side chains of the 20 natural and of 26 non-coded amino acids are compiled and simple methods for their evaluation described and it is shown that three principal components account for 66% of the total variance in the available set.

...read moreread less

Abstract: Fifteen physicochemical descriptors of side chains of the 20 natural and of 26 non-coded amino acids are compiled and simple methods for their evaluation described. The relevance of these parameters to account for hydrophobic, steric, and electric properties of the side chains is assessed and their intercorrelation analyzed. It is shown that three principal components, one steric, one bulk, and one electric (electronic), account for 66% of the total variance in the available set. These parameters may prove to be useful for correlation studies in series of bioactive peptide analogues.

...read moreread less

Journal Article•DOI•

PPCA-Based Missing Data Imputation for Traffic Flow Volume: A Systematical Approach

[...]

Li Qu¹, Jianming Hu¹, Li Li¹, Yi Zhang¹•Institutions (1)

Tsinghua University¹

01 Sep 2009-IEEE Transactions on Intelligent Transportation Systems

TL;DR: A new reliable method called probabilistic principal component analysis (PPCA) is put forward to impute the missing flow volume data based on historical data mining to reduce the root-mean-square imputation error by at least 25%.

...read moreread less

Abstract: The missing data problem greatly affects traffic analysis. In this paper, we put forward a new reliable method called probabilistic principal component analysis (PPCA) to impute the missing flow volume data based on historical data mining. First, we review the current missing data-imputation method and why it may fail to yield acceptable results in many traffic flow applications. Second, we examine the statistical properties of traffic flow volume time series. We show that the fluctuations of traffic flow are Gaussian type and that principal component analysis (PCA) can be used to retrieve the features of traffic flow. Third, we discuss how to use a robust PCA to filter out the abnormal traffic flow data that disturb the imputation process. Finally, we recall the theories of PPCA/Bayesian PCA-based imputation algorithms and compare their performance with some conventional methods, including the nearest/mean historical imputation methods and the local interpolation/regression methods. The experiments prove that the PPCA method provides significantly better performance than the conventional methods, reducing the root-mean-square imputation error by at least 25%.

...read moreread less

Journal Article•DOI•

Fault detection and diagnosis in process data using one-class support vector machines

[...]

Sankar Mahadevan¹, Sirish L. Shah¹•Institutions (1)

University of Alberta¹

01 Dec 2009-Journal of Process Control

TL;DR: It is shown that the proposed algorithm outperformed PCA and DPCA both in terms of detection and diagnosis of faults.

...read moreread less

Journal Article•DOI•

k-t PCA: temporally constrained k-t BLAST reconstruction using principal component analysis.

[...]

Henrik Pedersen¹, Henrik Pedersen², Sebastian Kozerke³, Steffen Ringgaard², Kay Nehrke⁴, Won Yong Kim² - Show less +2 more•Institutions (4)

Glostrup Hospital¹, Aarhus University Hospital², ETH Zurich³, Philips⁴

01 Sep 2009-Magnetic Resonance in Medicine

TL;DR: It is shown that temporal basis functions calculated by subjecting the training data to principal component analysis (PCA) can be used to constrain the reconstruction such that the temporal resolution is improved.

...read moreread less

Abstract: The k-t broad-use linear acquisition speed-up technique (BLAST) has become widespread for reducing image acquisition time in dynamic MRI. In its basic form k-t BLAST speeds up the data acquisition by undersampling k-space over time (referred to as k-t space). The resulting aliasing is resolved in the Fourier reciprocal x-f space (x = spatial position, f = temporal frequency) using an adaptive filter derived from a low-resolution estimate of the signal covariance. However, this filtering process tends to increase the reconstruction error or lower the achievable acceleration factor. This is problematic in applications exhibiting a broad range of temporal frequencies such as free-breathing myocardial perfusion imaging. We show that temporal basis functions calculated by subjecting the training data to principal component analysis (PCA) can be used to constrain the reconstruction such that the temporal resolution is improved. The presented method is called k-t PCA.

...read moreread less

Journal Article•DOI•

Multivariate analysis of remote laser-induced breakdown spectroscopy spectra using partial least squares, principal component analysis, and related techniques

[...]

Samuel M. Clegg¹, Elizabeth C. Sklute², M. Darby Dyar², James E. Barefield¹, Roger C. Wiens¹ - Show less +1 more•Institutions (2)

Los Alamos National Laboratory¹, Mount Holyoke College²

01 Jan 2009-Spectrochimica Acta Part B: Atomic Spectroscopy

TL;DR: In this paper, Principal Component Analysis and Soft Independent Modeling of Class Analogy are employed to generate a model and predict the rock type of the samples, which appear to exploit the matrix effects associated with the chemistries of these 18 samples.

...read moreread less

Journal Article•DOI•

Kernel principal component analysis for the classification of hyperspectral remote sensing data over urban areas

[...]

Mathieu Fauvel¹, Jocelyn Chanussot², Jon Atli Benediktsson¹•Institutions (2)

University of Iceland¹, Grenoble Institute of Technology²

01 Jan 2009-EURASIP Journal on Advances in Signal Processing

TL;DR: Experimental results presented in this paper confirm the usefulness of the KPCA for the analysis of hyperspectral data and improve results in terms of accuracy.

...read moreread less

Abstract: Kernel principal component analysis (KPCA) is investigated for feature extraction from hyperspectral remote sensing data. Features extracted using KPCA are classified using linear support vector machines. In one experiment, it is shown that kernel principal component features are more linearly separable than features extracted with conventional principal component analysis. In a second experiment, kernel principal components are used to construct the extended morphological profile (EMP). Classification results, in terms of accuracy, are improved in comparison to original approach which used conventional principal component analysis for constructing the EMP. Experimental results presented in this paper confirm the usefulness of the KPCA for the analysis of hyperspectral data. For the one data set, the overall classification accuracy increases from 79% to 96% with the proposed approach.

...read moreread less

Proceedings Article•DOI•

Activity recognition from acceleration data based on discrete consine transform and SVM

[...]

Zhenyu He¹, Lianwen Jin¹•Institutions (1)

South China University of Technology¹

04 Dec 2009

TL;DR: A high-accuracy human activity recognition system based on single tri-axis accelerometer for use in a naturalistic environment that exploits the discrete cosine transform, the Principal Component Analysis (PCA) and Support Vector Machine for classification human different activity.

...read moreread less

Abstract: This paper developed a high-accuracy human activity recognition system based on single tri-axis accelerometer for use in a naturalistic environment. This system exploits the discrete cosine transform (DCT), the Principal Component Analysis (PCA) and Support Vector Machine (SVM) for classification human different activity. First, the effective features are extracted from accelerometer data using DCT. Next, feature dimension is reduced by PCA in DCT domain. After implementing the PCA, the most invariant and discriminating information for recognition is maintained. As a consequence, Multi-class Support Vector Machines is adopted to distinguish different human activities. Experiment results show that the proposed system achieves the best accuracy is 97.51%, which is better than other approaches.

...read moreread less

Journal Article•DOI•

High-dimensional analysis of semidefinite relaxations for sparse principal components

[...]

Arash A. Amini, Martin J. Wainwright

01 Oct 2009-Annals of Statistics

TL;DR: In this paper, the authors consider a spiked covariance model in which a base matrix is perturbed by adding a k-sparse maximal eigenvector, and analyze two computationally tractable methods for recovering the support set of this maximal eigvector, as follows: (a) a simple diagonal thresholding method, which transitions from success to failure as a function of the rescaled sample size θdia(n, p, k)=n/[k2log(p−k)]; and (b) a more sophisticated semidefinite programming

...read moreread less

Abstract: Principal component analysis (PCA) is a classical method for dimensionality reduction based on extracting the dominant eigenvectors of the sample covariance matrix. However, PCA is well known to behave poorly in the “large p, small n” setting, in which the problem dimension p is comparable to or larger than the sample size n. This paper studies PCA in this high-dimensional regime, but under the additional assumption that the maximal eigenvector is sparse, say, with at most k nonzero components. We consider a spiked covariance model in which a base matrix is perturbed by adding a k-sparse maximal eigenvector, and we analyze two computationally tractable methods for recovering the support set of this maximal eigenvector, as follows: (a) a simple diagonal thresholding method, which transitions from success to failure as a function of the rescaled sample size θdia(n, p, k)=n/[k2log(p−k)]; and (b) a more sophisticated semidefinite programming (SDP) relaxation, which succeeds once the rescaled sample size θsdp(n, p, k)=n/[klog(p−k)] is larger than a critical threshold. In addition, we prove that no method, including the best method which has exponential-time complexity, can succeed in recovering the support if the order parameter θsdp(n, p, k) is below a threshold. Our results thus highlight an interesting trade-off between computational and statistical efficiency in high-dimensional inference.

...read moreread less

Journal Article•DOI•

Finite sample approximation results for principal component analysis: a matrix perturbation approach

[...]

Boaz Nadler

21 Jan 2009-arXiv: Statistics Theory

TL;DR: A matrix perturbation view of the "phase transition phenomenon," and a simple linear-algebra based derivation of the eigenvalue and eigenvector overlap in this asymptotic limit of finite sample PCA are presented.

...read moreread less

Abstract: Principal component analysis (PCA) is a standard tool for dimensional reduction of a set of $n$ observations (samples), each with $p$ variables. In this paper, using a matrix perturbation approach, we study the nonasymptotic relation between the eigenvalues and eigenvectors of PCA computed on a finite sample of size $n$, and those of the limiting population PCA as $n\to\infty$. As in machine learning, we present a finite sample theorem which holds with high probability for the closeness between the leading eigenvalue and eigenvector of sample PCA and population PCA under a spiked covariance model. In addition, we also consider the relation between finite sample PCA and the asymptotic results in the joint limit $p,n\to\infty$, with $p/n=c$. We present a matrix perturbation view of the "phase transition phenomenon," and a simple linear-algebra based derivation of the eigenvalue and eigenvector overlap in this asymptotic limit. Moreover, our analysis also applies for finite $p,n$ where we show that although there is no sharp phase transition as in the infinite case, either as a function of noise level or as a function of sample size $n$, the eigenvector of sample PCA may exhibit a sharp "loss of tracking," suddenly losing its relation to the (true) eigenvector of the population PCA matrix. This occurs due to a crossover between the eigenvalue due to the signal and the largest eigenvalue due to noise, whose eigenvector points in a random direction.

...read moreread less

Journal Article•DOI•

Pca consistency in high dimension, low sample size context

[...]

Sungkyu Jung¹, James Stephen Marron•Institutions (1)

University of North Carolina at Chapel Hill¹

01 Dec 2009-Annals of Statistics

TL;DR: In this paper, the authors investigate the asymptotic behavior of the Principal Component (PC) directions in HDLSS data and show that if the first few eigenvalues of a population covariance matrix are large enough compared to the others, then the corresponding estimated PC directions are consistent or converge to the appropriate subspace (subspace consistency).

...read moreread less

Abstract: Principal Component Analysis (PCA) is an important tool of dimension reduction especially when the dimension (or the number of variables) is very high. Asymptotic studies where the sample size is fixed, and the dimension grows (i.e. High Dimension, Low Sample Size (HDLSS)) are becoming increasingly relevant. We investigate the asymptotic behavior of the Principal Component (PC) directions. HDLSS asymptotics are used to study consistency, strong inconsistency and subspace consistency. We show that if the first few eigenvalues of a population covariance matrix are large enough compared to the others, then the corresponding estimated PC directions are consistent or converge to the appropriate subspace (subspace consistency) and most other PC directions are strongly inconsistent. Broad sets of sucient conditions for each of these cases are specified and the main theorem gives a catalogue of possible combinations. In preparation for these results, we show that the geometric representation of HDLSS data holds under general conditions, which includes a mixing condition and a broad range of sphericity measures of the covariance matrix.

...read moreread less

Journal Article•DOI•

PCA consistency in high dimension, low sample size context

[...]

Sungkyu Jung¹, James Stephen Marron•Institutions (1)

University of North Carolina at Chapel Hill¹

19 Nov 2009-arXiv: Statistics Theory

TL;DR: This work investigates the asymptotic behavior of the Principal Component (PC) directions and shows that if the first few eigenvalues of a population covariance matrix are large enough compared to the others, then the corresponding estimated PC directions are consistent or converge to the appropriate subspace (subspace consistency) and most otherPC directions are strongly inconsistent.

...read moreread less

Abstract: Principal Component Analysis (PCA) is an important tool of dimension reduction especially when the dimension (or the number of variables) is very high. Asymptotic studies where the sample size is fixed, and the dimension grows [i.e., High Dimension, Low Sample Size (HDLSS)] are becoming increasingly relevant. We investigate the asymptotic behavior of the Principal Component (PC) directions. HDLSS asymptotics are used to study consistency, strong inconsistency and subspace consistency. We show that if the first few eigenvalues of a population covariance matrix are large enough compared to the others, then the corresponding estimated PC directions are consistent or converge to the appropriate subspace (subspace consistency) and most other PC directions are strongly inconsistent. Broad sets of sufficient conditions for each of these cases are specified and the main theorem gives a catalogue of possible combinations. In preparation for these results, we show that the geometric representation of HDLSS data holds under general conditions, which includes a $\rho$-mixing condition and a broad range of sphericity measures of the covariance matrix.

...read moreread less

Collapse