scispace - formally typeset
Search or ask a question

Showing papers on "Principal component analysis published in 2007"


Book
03 May 2007
TL;DR: In this paper, the effects of rice farming on aquatic birds with mixed modelling were investigated using additive and generalised additive modeling and univariate methods to analyse abundance of decapod larvae.
Abstract: Introduction.- Data management and software.- Advice for teachers.- Exploration.- Linear regression.- Generalised linear modelling.- Additive and generalised additive modelling.- Introduction to mixed modelling.- Univariate tree models.- Measures of association.- Ordination--first encounter.- Principal component analysis and redundancy analysis.- Correspondence analysis and canonical correspondence analysis.- Introduction to discriminant analysis.- Principal coordinate analysis and non-metric multidimensional scaling.- Time series analysis--Introduction.- Common trends and sudden changes.- Analysis and modelling lattice data.- Spatially continuous data analysis and modelling.- Univariate methods to analyse abundance of decapod larvae.- Analysing presence and absence data for flatfish distribution in the Tagus estuary, Portugual.- Crop pollination by honeybees in an Argentinean pampas system using additive mixed modelling.- Investigating the effects of rice farming on aquatic birds with mixed modelling.- Classification trees and radar detection of birds for North Sea wind farms.- Fish stock identification through neural network analysis of parasite fauna.- Monitoring for change: using generalised least squares, nonmetric multidimensional scaling, and the Mantel test on western Montana grasslands.- Univariate and multivariate analysis applied on a Dutch sandy beach community.- Multivariate analyses of South-American zoobenthic species--spoilt for choice.- Principal component analysis applied to harbour porpoise fatty acid data.- Multivariate analysis of morphometric turtle data--size and shape.- Redundancy analysis and additive modelling applied on savanna tree data.- Canonical correspondence analysis of lowland pasture vegetation in the humid tropics of Mexico.- Estimating common trends in Portuguese fisheries landings.- Common trends in demersal communities on the Newfoundland-Labrador Shelf.- Sea level change and salt marshes in the Wadden Sea: a time series analysis.- Time series analysis of Hawaiian waterbirds.- Spatial modelling of forest community features in the Volzhsko-Kamsky reserve.

1,788 citations


Journal ArticleDOI
TL;DR: This study illustrates the usefulness of multivariate statistical techniques for analysis and interpretation of complex data sets, and in water quality assessment, identification of pollution sources/factors and understanding temporal/spatial variations in waterquality for effective river water quality management.
Abstract: Multivariate statistical techniques, such as cluster analysis (CA), principal component analysis (PCA), factor analysis (FA) and discriminant analysis (DA), were applied for the evaluation of temporal/spatial variations and the interpretation of a large complex water quality data set of the Fuji river basin, generated during 8 years (1995–2002) monitoring of 12 parameters at 13 different sites (14 976 observations). Hierarchical cluster analysis grouped 13 sampling sites into three clusters, i.e., relatively less polluted (LP), medium polluted (MP) and highly polluted (HP) sites, based on the similarity of water quality characteristics. Factor analysis/principal component analysis, applied to the data sets of the three different groups obtained from cluster analysis, resulted in five, five and three latent factors explaining 73.18, 77.61 and 65.39% of the total variance in water quality data sets of LP, MP and HP areas, respectively. The varifactors obtained from factor analysis indicate that the parameters responsible for water quality variations are mainly related to discharge and temperature (natural), organic pollution (point source: domestic wastewater) in relatively less polluted areas; organic pollution (point source: domestic wastewater) and nutrients (non-point sources: agriculture and orchard plantations) in medium polluted areas; and organic pollution and nutrients (point sources: domestic wastewater, wastewater treatment plants and industries) in highly polluted areas in the basin. Discriminant analysis gave the best results for both spatial and temporal analysis. It provided an important data reduction as it uses only six parameters (discharge, temperature, dissolved oxygen, biochemical oxygen demand, electrical conductivity and nitrate nitrogen), affording more than 85% correct assignations in temporal analysis, and seven parameters (discharge, temperature, biochemical oxygen demand, pH, electrical conductivity, nitrate nitrogen and ammonical nitrogen), affording more than 81% correct assignations in spatial analysis, of three different sampling sites of the basin. Therefore, DA allowed a reduction in the dimensionality of the large data set, delineating a few indicator parameters responsible for large variations in water quality. Thus, this study illustrates the usefulness of multivariate statistical techniques for analysis and interpretation of complex data sets, and in water quality assessment, identification of pollution sources/factors and understanding temporal/spatial variations in water quality for effective river water quality management.

1,481 citations


Journal ArticleDOI
TL;DR: This study introduces and investigates the use of kernel PCA for novelty detection and demonstrated a competitive performance on two-dimensional synthetic distributions and on two real-world data sets: handwritten digits and breast-cancer cytology.

727 citations


Journal ArticleDOI
TL;DR: In this paper, an alternative approach based on quadratic regularisation is suggested and shown to have advantages from some points of view, and it is shown that optimal convergence rates are achieved by the PCA technique in certain circumstances.
Abstract: In functional linear regression, the slope "parameter" is a function. Therefore, in a nonparametric context, it is determined by an infinite number of unknowns. Its estimation involves solving an ill-posed problem and has points of contact with a range of methodologies, including statistical smoothing and deconvolution. The standard approach to estimating the slope function is based explicitly on functional principal components analysis and, consequently, on spectral decomposition in terms of eigenvalues and eigen-functions. We discuss this approach in detail and show that in certain circumstances, optimal convergence rates are achieved by the PCA technique. An alternative approach based on quadratic regularisation is suggested and shown to have advantages from some points of view.

597 citations


Journal ArticleDOI
TL;DR: New extensions to the previously published multivariate alteration detection (MAD) method for change detection in bi-temporal, multi- and hypervariate data such as remote sensing imagery and three regularization schemes are described.
Abstract: This paper describes new extensions to the previously published multivariate alteration detection (MAD) method for change detection in bi-temporal, multi- and hypervariate data such as remote sensing imagery. Much like boosting methods often applied in data mining work, the iteratively reweighted (IR) MAD method in a series of iterations places increasing focus on "difficult" observations, here observations whose change status over time is uncertain. The MAD method is based on the established technique of canonical correlation analysis: for the multivariate data acquired at two points in time and covering the same geographical region, we calculate the canonical variates and subtract them from each other. These orthogonal differences contain maximum information on joint change in all variables (spectral bands). The change detected in this fashion is invariant to separate linear (affine) transformations in the originally measured variables at the two points in time, such as 1) changes in gain and offset in the measuring device used to acquire the data, 2) data normalization or calibration schemes that are linear (affine) in the gray values of the original variables, or 3) orthogonal or other affine transformations, such as principal component (PC) or maximum autocorrelation factor (MAF) transformations. The IR-MAD method first calculates ordinary canonical and original MAD variates. In the following iterations we apply different weights to the observations, large weights being assigned to observations that show little change, i.e., for which the sum of squared, standardized MAD variates is small, and small weights being assigned to observations for which the sum is large. Like the original MAD method, the iterative extension is invariant to linear (affine) transformations of the original variables. To stabilize solutions to the (IR-)MAD problem, some form of regularization may be needed. This is especially useful for work on hyperspectral data. This paper describes ordinary two-set canonical correlation analysis, the MAD transformation, the iterative extension, and three regularization schemes. A simple case with real Landsat Thematic Mapper (TM) data at one point in time and (partly) constructed data at the other point in time that demonstrates the superiority of the iterative scheme over the original MAD method is shown. Also, examples with SPOT High Resolution Visible data from an agricultural region in Kenya, and hyperspectral airborne HyMap data from a small rural area in southeastern Germany are given. The latter case demonstrates the need for regularization

595 citations


Journal ArticleDOI
TL;DR: In this article, a methodology to determine the number of primitive shocks in a large number of macroeconomic time series is proposed, without having to estimate the dynamic factors, which is a result of independent interest.
Abstract: A widely held but untested assumption underlying macroeconomic analysis is that the number of shocks driving economic fluctuations, q, is small. In this article we associate q with the number of dynamic factors in a large panel of data. We propose a methodology to determineq without having to estimate the dynamic factors. We first estimate a VAR in r static factors, where the factors are obtained by applying the method of principal components to a large panel of data, then compute the eigenvalues of the residual covariance or correlation matrix. We then test whether their eigenvalues satisfy an asymptotically shrinking bound that reflects sampling error. We apply the procedure to determine the number of primitive shocks in a large number of macroeconomic time series. An important aspect of the present analysis is to make precise the relationship between the dynamic factors and the static factors, which is a result of independent interest.

562 citations


Journal ArticleDOI
TL;DR: The authors provide a didactic treatment of nonlinear (categorical) principal components analysis (PCA), which is the nonlinear equivalent of standard PCA and reduces the observed variables to a number of uncorrelated principal components.
Abstract: The authors provide a didactic treatment of nonlinear (categorical) principal components analysis (PCA). This method is the nonlinear equivalent of standard PCA and reduces the observed variables to a number of uncorrelated principal components. The most important advantages of nonlinear over linear PCA are that it incorporates nominal and ordinal variables and that it can handle and discover nonlinear relationships between variables. Also, nonlinear PCA can deal with variables at their appropriate measurement level; for example, it can treat Likert-type scales ordinally instead of numerically. Every observed value of a variable can be referred to as a category. While performing PCA, nonlinear PCA converts every category to a numeric value, in accordance with the variable's analysis level, using optimal quantification. The authors discuss how optimal quantification is carried out, what analysis levels are, which decisions have to be made when applying nonlinear PCA, and how the results can be interpreted. The strengths and limitations of the method are discussed. An example applying nonlinear PCA to empirical data using the program CATPCA (J. J. Meulman, W. J. Heiser, & SPSS, 2004) is provided.

498 citations


Book
02 Apr 2007
TL;DR: This book focuses on the development of Chemometrics through the application of unsupervised pattern recognition to the study of Spectroscopy and its applications in medicine and science.
Abstract: Preface. 1 Introduction. 1.1 Development of Chemometrics. 1.2 Application Areas. 1.3 How to Use this Book. 1.4 Literature and Other Sources of Information. References. 2 Experimental Design. 2.1 Why Design Experiments in Chemistry? 2.2 Degrees of Freedom and Sources of Error. 2.3 Analysis of Variance and Interpretation of Errors. 2.4 Matrices, Vectors and the Pseudoinverse. 2.5 Design Matrices. 2.6 Factorial Designs. 2.7 An Example of a Factorial Design. 2.8 Fractional Factorial Designs. 2.9 Plackett-Burman and Taguchi Designs. 2.10 The Application of a Plackett-Burman Design to the Screening of Factors Influencing a Chemical Reaction. 2.11 Central Composite Designs. 2.12 Mixture Designs. 2.13 A Four Component Mixture Design Used to Study Blending of Olive Oils. 2.14 Simplex Optimization. 2.15 Leverage and Confidence in Models. 2.16 Designs for Multivariate Calibration. References. 3 Statistical Concepts. 3.1 Statistics for Chemists. 3.2 Errors. 3.3 Describing Data. 3.4 The Normal Distribution. 3.5 Is a Distribution Normal? 3.6 Hypothesis Tests. 3.7 Comparison of Means: the t-Test. 3.8 F-Test for Comparison of Variances. 3.9 Confidence in Linear Regression. 3.10 More about Confidence. 3.11 Consequences of Outliers and How to Deal with Them. 3.12 Detection of Outliers. 3.13 Shewhart Charts. 3.14 More about Control Charts. References. 4 Sequential Methods. 4.1 Sequential Data. 4.2 Correlograms. 4.3 Linear Smoothing Functions and Filters. 4.4 Fourier Transforms. 4.5 Maximum Entropy and Bayesian Methods. 4.6 Fourier Filters. 4.7 Peakshapes in Chromatography and Spectroscopy. 4.8 Derivatives in Spectroscopy and Chromatography. 4.9 Wavelets. References. 5 Pattern Recognition. 5.1 Introduction. 5.2 Principal Components Analysis. 5.3 Graphical Representation of Scores and Loadings. 5.4 Comparing Multivariate Patterns. 5.5 Preprocessing. 5.6 Unsupervised Pattern Recognition: Cluster Analysis. 5.7 Supervised Pattern Recognition. 5.8 Statistical Classification Techniques. 5.9 K Nearest Neighbour Method. 5.10 How Many Components Characterize a Dataset? 5.11 Multiway Pattern Recognition. References. 6 Calibration. 6.1 Introduction. 6.2 Univariate Calibration. 6.3 Multivariate Calibration and the Spectroscopy of Mixtures. 6.4 Multiple Linear Regression. 6.5 Principal Components Regression. 6.6 Partial Least Squares. 6.7 How Good is the Calibration and What is the Most Appropriate Model? 6.8 Multiway Calibration. References. 7 Coupled Chromatography. 7.1 Introduction. 7.2 Preparing the Data. 7.3 Chemical Composition of Sequential Data. 7.4 Univariate Purity Curves. 7.5 Similarity Based Methods. 7.6 Evolving and Window Factor Analysis. 7.7 Derivative Based Methods. 7.8 Deconvolution of Evolutionary Signals. 7.9 Noniterative Methods for Resolution. 7.10 Iterative Methods for Resolution. 8 Equilibria, Reactions and Process Analytics. 8.1 The Study of Equilibria using Spectroscopy. 8.2 Spectroscopic Monitoring of Reactions. 8.3 Kinetics and Multivariate Models for the Quantitative Study of Reactions 8.4 Developments in the Analysis of Reactions using On-line Spectroscopy. 8.5 The Process Analytical Technology Initiative. References. 9 Improving Yields and Processes Using Experimental Designs. 9.1 Introduction. 9.2 Use of Statistical Designs for Improving the Performance of Synthetic Reactions. 9.3 Screening for Factors that Influence the Performance of a Reaction. 9.4 Optimizing the Process Variables. 9.5 Handling Mixture Variables using Simplex Designs. 9.6 More about Mixture Variables. 10 Biological and Medical Applications of Chemometrics. 10.1 Introduction. 10.2 Taxonomy. 10.3 Discrimination. 10.4 Mahalanobis Distance. 10.5 Bayesian Methods and Contingency Tables. 10.6 Support Vector Machines. 10.7 Discriminant Partial Least Squares. 10.8 Micro-organisms. 10.9 Medical Diagnosis using Spectroscopy. 10.10 Metabolomics using Coupled Chromatography and Nuclear Magnetic Resonance. References. 11 Biological Macromolecules. 11.1 Introduction. 11.2 Sequence Alignment and Scoring Matches. 11.3 Sequence Similarity. 11.4 Tree Diagrams. 11.5 Phylogenetic Trees. References. 12 Multivariate Image Analysis. 12.1 Introduction. 12.2 Scaling Images. 12.3 Filtering and Smoothing the Image. 12.4 Principal Components for the Enhancement of Images. 12.5 Regression of Images. 12.6 Alternating Least Squares as Employed in Image Analysis. 12.7 Multiway Methods In Image Analysis. References. 13 Food. 13.1 Introduction. 13.2 How to Determine the Origin of a Food Product using Chromatography. 13.3 Near Infrared Spectroscopy. 13.4 Other Information. 13.5 Sensory Analysis: Linking Composition to Properties. 13.6 Varimax Rotation. 13.7 Calibrating Sensory Descriptors to Composition. References. Index.

496 citations


Journal ArticleDOI
TL;DR: Predicting next day hourly ozone concentrations through a new methodology based on feedforward artificial neural networks using principal components as inputs improved both models prediction by reducing their complexity and eliminating data collinearity.
Abstract: The prediction of tropospheric ozone concentrations is very important due to the negative impacts of ozone on human health, climate and vegetation. The development of models to predict ozone concentrations is thus very useful because it can provide early warnings to the population and also reduce the number of measuring sites. The aim of this study was to predict next day hourly ozone concentrations through a new methodology based on feedforward artificial neural networks using principal components as inputs. The developed model was compared with multiple linear regression, feedforward artificial neural networks based on the original data and also with principal component regression. Results showed that the use of principal components as inputs improved both models prediction by reducing their complexity and eliminating data collinearity.

472 citations


Proceedings ArticleDOI
12 Jun 2007
TL;DR: This study identifies and evaluates four main challenges of using PCA to detect traffic anomalies: the false positive rate is very sensitive to small differences in the number of principal components in the normal subspace, the effectiveness of PCA is sensitive to the level of aggregation of the traffic measurements, a large anomaly may in advertently pollute the normalSubspace.
Abstract: Detecting anomalous traffic is a crucial part of managing IP networks. In recent years, network-wide anomaly detection based on Principal Component Analysis (PCA) has emerged as a powerful method for detecting a wide variety of anomalies. We show that tuning PCA to operate effectively in practice is difficult and requires more robust techniques than have been presented thus far. We analyze a week of network-wide traffic measurements from two IP backbones (Abilene and Geant) across three different traffic aggregations (ingress routers, OD flows, and input links), and conduct a detailed inspection of the feature time series for each suspected anomaly. Our study identifies and evaluates four main challenges of using PCA to detect traffic anomalies: (i) the false positive rate is very sensitive to small differences in the number of principal components in the normal subspace, (ii) the effectiveness of PCA is sensitive to the level of aggregation of the traffic measurements, (iii) a large anomaly may in advertently pollute the normal subspace, (iv) correctly identifying which flow triggered the anomaly detector is an inherently challenging problem.

412 citations


Journal ArticleDOI
TL;DR: This study compared the gait of 50 patients with end-stage knee osteoarthritis to a group of 63 age-matched asymptomatic control subjects to determine gait pattern differences between the OA and the control groups.

Journal ArticleDOI
TL;DR: Experimental results reveal that, not only does the proposed PCA- based coder yield rate-distortion and information-preservation performance superior to that of the wavelet-based coder, the best PCA performance occurs when a reduced number of PCs are retained and coded.
Abstract: Principal component analysis (PCA) is deployed in JPEG2000 to provide spectral decorrelation as well as spectral dimensionality reduction. The proposed scheme is evaluated in terms of rate-distortion performance as well as in terms of information preservation in an anomaly-detection task. Additionally, the proposed scheme is compared to the common approach of JPEG2000 coupled with a wavelet transform for spectral decorrelation. Experimental results reveal that, not only does the proposed PCA-based coder yield rate-distortion and information-preservation performance superior to that of the wavelet-based coder, the best PCA performance occurs when a reduced number of PCs are retained and coded. A linear model to estimate the optimal number of PCs to use in such dimensionality reduction is proposed

Proceedings ArticleDOI
26 Dec 2007
TL;DR: This paper proposes a novel dimensionality reduction framework, called spectral regression (SR), for efficient regularized subspace learning, which casts the problem of learning the projective functions into a regression framework, which avoids eigen-decomposition of dense matrices.
Abstract: Subspace learning based face recognition methods have attracted considerable interests in recent years, including principal component analysis (PCA), linear discriminant analysis (LDA), locality preserving projection (LPP), neighborhood preserving embedding (NPE) and marginal Fisher analysis (MFA). However, a disadvantage of all these approaches is that their computations involve eigen- decomposition of dense matrices which is expensive in both time and memory. In this paper, we propose a novel dimensionality reduction framework, called spectral regression (SR), for efficient regularized subspace learning. SR casts the problem of learning the projective functions into a regression framework, which avoids eigen-decomposition of dense matrices. Also, with the regression based framework, different kinds of regularizes can be naturally incorporated into our algorithm which makes it more flexible. Computational analysis shows that SR has only linear-time complexity which is a huge speed up comparing to the cubic-time complexity of the ordinary approaches. Experimental results on face recognition demonstrate the effectiveness and efficiency of our method.

Proceedings ArticleDOI
17 Jun 2007
TL;DR: This paper introduces a regularized subspace learning model using a Laplacian penalty to constrain the coefficients to be spatially smooth and shows results on face recognition which are better for image representation than their original version.
Abstract: Subspace learning based face recognition methods have attracted considerable interests in recently years, including principal component analysis (PCA), linear discriminant analysis (LDA), locality preserving projection (LPP), neighborhood preserving embedding (NPE), marginal fisher analysis (MFA) and local discriminant embedding (LDE). These methods consider an n1timesn2 image as a vector in Rn 1 timesn 2 and the pixels of each image are considered as independent. While an image represented in the plane is intrinsically a matrix. The pixels spatially close to each other may be correlated. Even though we have n1xn2 pixels per image, this spatial correlation suggests the real number of freedom is far less. In this paper, we introduce a regularized subspace learning model using a Laplacian penalty to constrain the coefficients to be spatially smooth. All these existing subspace learning algorithms can fit into this model and produce a spatially smooth subspace which is better for image representation than their original version. Recognition, clustering and retrieval can be then performed in the image subspace. Experimental results on face recognition demonstrate the effectiveness of our method.

01 Jan 2007
TL;DR: PLS regression is a recent technique that generalizes and combines features from principal component analysis and multiple regression that is becoming a tool of choice in the social sciences as amultivariate technique for nonexperimental and experimental data alike.
Abstract: PLS regression is a recent technique that generalizes and combines features from principal component analysis and multiple regression. Its goal is to predict or analyze a set of dependent variables from a set of independent variables or predictors. This prediction is achieved by extracting from the predictors a set of orthogonal factors called latent variables which have the best predictive power. PLS regression is particularly useful when we need to predict a set of dependent variables from a (very) large set of independent variables (i.e., predictors). It originated in the social sciences (specifically economy, Herman Wold 1966) but became popular first in chemometrics (i.e., computational chemistry) due in part to Herman’s son Svante, (Wold, 2001) and in sensory evaluation (Martens & Naes, 1989). But PLS regression is also becoming a tool of choice in the social sciences as amultivariate technique for nonexperimental and experimental data alike (e.g., neuroimaging, see Mcintosh & Lobaugh, 2004; Worsley, 1997). It was first presented

Proceedings ArticleDOI
29 Sep 2007
TL;DR: In this paper, the authors proposed a method for dimensionality reduction of a feature set by choosing a subset of the original features that contains most of the essential information, using the same criteria as PCA.
Abstract: Dimensionality reduction of a feature set is a common preprocessing step used for pattern recognition and classification applications. Principal Component Analysis (PCA) is one of the popular methods used, and can be shown to be optimal using different optimality criteria. However, it has the disadvantage that measurements from all the original features are used in the projection to the lower dimensional space. This paper proposes a novel method for dimensionality reduction of a feature set by choosing a subset of the original features that contains most of the essential information, using the same criteria as PCA. We call this method Principal Feature Analysis (PFA). The proposed method is successfully applied for choosing the principal features in face tracking and content-based image retrieval (CBIR) problems. Automated annotation of digital pictures has been a highly challenging problem for computer scientists since the invention of computers. The capability of annotating pictures by computers can lead to breakthroughs in a wide range of applications including Web image search, online picture-sharing communities, and scientific experiments. In our work, by advancing statistical modeling and optimization techniques, we can train computers about hundreds of semantic concepts using example pictures from each concept. The ALIPR (Automatic Linguistic Indexing of Pictures - Real Time) system of fully automatic and high speed annotation for online pictures has been constructed. Thousands of pictures from an Internet photo-sharing site, unrelated to the source of those pictures used in the training process, have been tested. The experimental results show that a single computer processor can suggest annotation terms in real-time and with good accuracy.

Journal ArticleDOI
TL;DR: Several ECG applications are reviewed where PCA techniques have been successfully employed, including data compression, ST-T segment analysis for the detection of myocardial ischemia and abnormalities in ventricular repolarization, extraction of atrial fibrillatory waves for detailed characterization of atrium fibrillation, and analysis of body surface potential maps.
Abstract: This paper reviews the current status of principal component analysis in the area of ECG signal processing. The fundamentals of PCA are briefly described and the relationship between PCA and Karhunen-Loeve transform is explained. Aspects on PCA related to data with temporal and spatial correlations are considered as adaptive estimation of principal components is. Several ECG applications are reviewed where PCA techniques have been successfully employed, including data compression, ST-T segment analysis for the detection of myocardial ischemia and abnormalities in ventricular repolarization, extraction of atrial fibrillatory waves for detailed characterization of atrial fibrillation, and analysis of body surface potential maps.

Journal ArticleDOI
TL;DR: Three classification techniques (loading and score projections based on principal components analysis, cluster analysis and self-organizing maps) were applied to a large environmental data set of chemical indicators of river water quality and revealed different patterns of monitoring sites conditionally named "tributary", "urban", "rural" or "background".

Journal ArticleDOI
TL;DR: This paper proposes a method, named orthogonal neighborhood preserving projections, which works by first building an "affinity" graph for the data in a way that is similar to the method of locally linear embedding (LLE); in contrast with the standard LLE, ONPP employs an explicit linear mapping between the input and the reduced spaces.
Abstract: This paper considers the problem of dimensionality reduction by orthogonal projection techniques. The main feature of the proposed techniques is that they attempt to preserve both the intrinsic neighborhood geometry of the data samples and the global geometry. In particular, we propose a method, named orthogonal neighborhood preserving projections, which works by first building an "affinity" graph for the data in a way that is similar to the method of locally linear embedding (LLE). However, in contrast with the standard LLE where the mapping between the input and the reduced spaces is implicit, ONPP employs an explicit linear mapping between the two. As a result, handling new data samples becomes straightforward, as this amounts to a simple linear transformation. We show how we can define kernel variants of ONPP, as well as how we can apply the method in a supervised setting. Numerical experiments are reported to illustrate the performance of ONPP and to compare it with a few competing methods.

Journal ArticleDOI
TL;DR: In this article, two versions of functional Principal Component Regression (PCR) are developed, both using B-splines and roughness penalties, and the regularized-components version applies such a penalty to the construction of the principal components.
Abstract: Regression of a scalar response on signal predictors, such as near-infrared (NIR) spectra of chemical samples, presents a major challenge when, as is typically the case, the dimension of the signals far exceeds their number. Most solutions to this problem reduce the dimension of the predictors either by regressing on components [e.g., principal component regression (PCR) and partial least squares (PLS)] or by smoothing methods, which restrict the coefficient function to the span of a spline basis. This article introduces functional versions of PCR and PLS, which combine both of the foregoing dimension-reduction approaches. Two versions of functional PCR are developed, both using B-splines and roughness penalties. The regularized-components version applies such a penalty to the construction of the principal components (i.e., it uses functional principal components), whereas the regularized-regression version incorporates a penalty in the regression. For the latter form of functional PCR, the penalty parame...

Journal ArticleDOI
TL;DR: The case studies reveal the effectiveness of the systematic framework in deriving data-driven soft sensors that provide reasonably reliable one-step-ahead predictions.

Journal ArticleDOI
TL;DR: A novel and uniform framework for both face identification and verification is presented, based on a combination of Gabor wavelets and General Discriminant Analysis, and can be considered appearance based in that features are extracted from the whole face image and subjected to subspace projection.

Journal ArticleDOI
TL;DR: Results show that trace elements of V, Cr, As, Mo, W, and U with greatest positive loadings typically occur as soluble oxyanions in oxidizing waters, while Mn and Co with greatest negative loadings are generally more soluble within oxygen depleted groundwater.

Journal ArticleDOI
TL;DR: A new monitoring method based on independent component analysis−principal component analysis (ICA−PCA) is proposed, where the Gaussian and non-Gaussian information can be extracted for fault detection and diagnosis and a new mixed similarity factor is proposed.
Abstract: Many of the current multivariate statistical process monitoring techniques (such as principal component analysis (PCA) or partial least squares (PLS)) do not utilize the non-Gaussian information of...

Journal ArticleDOI
TL;DR: Shannon entropy is used to provide an estimate of the number of interpretable components in a principal component analysis and several ad hoc stopping rules for dimension determination are reviewed and a modification of the broken stick model is presented.
Abstract: Shannon entropy is used to provide an estimate of the number of interpretable components in a principal component analysis. In addition, several ad hoc stopping rules for dimension determination are reviewed and a modification of the broken stick model is presented. The modification incorporates a test for the presence of an "effective degeneracy" among the subspaces spanned by the eigenvectors of the correlation matrix of the data set then allocates the total variance among subspaces. A summary of the performance of the methods applied to both published microarray data sets and to simulated data is given.

BookDOI
24 Oct 2007
TL;DR: The book starts with the quote of the classical Pearson definition of PCA and includes reviews of various methods: NLPCA, ICA, MDS, embedding and clustering algorithms, principal manifolds and SOM.
Abstract: In 1901, Karl Pearson invented Principal Component Analysis (PCA). Since then, PCA serves as a prototype for many other tools of data analysis, visualization and dimension reduction: Independent Component Analysis (ICA), Multidimensional Scaling (MDS), Nonlinear PCA (NLPCA), Self Organizing Maps (SOM), etc. The book starts with the quote of the classical Pearson definition of PCA and includes reviews of various methods: NLPCA, ICA, MDS, embedding and clustering algorithms, principal manifolds and SOM. New approaches to NLPCA, principal manifolds, branching principal components and topology preserving mappings are described as well. Presentation of algorithms is supplemented by case studies, from engineering to astronomy, but mostly of biological data: analysis of microarray and metabolite data. The volume ends with a tutorial "PCA and K-meansdecipher genome". The book is meant to be useful for practitioners in applied data analysis in life sciences, engineering, physics and chemistry; it will also be valuable to PhD students and researchers in computer sciences, applied mathematics and statistics.

Journal ArticleDOI
TL;DR: In this article, six popular approaches of NIR spectrum-property calibration model building are compared on the basis of a gasoline spectral data and the best preprocessing technique is found for each method.

Proceedings ArticleDOI
28 Oct 2007
TL;DR: This paper proposes a novel dimensionality reduction framework, called Unified Sparse Subspace Learning (USSL), for learning sparse projections, which casts the problem of learning the projective functions into a regression framework, which facilitates the use of different kinds of regularizes.
Abstract: Recently the problem of dimensionality reduction (or, subspace learning) has received a lot of interests in many fields of information processing, including data mining, information retrieval, and pattern recognition. Some popular methods include principal component analysis (PCA), linear discriminant analysis (LDA) and locality preserving projection (LPP). However, a disadvantage of all these approaches is that the learned projective functions are linear combinations of all the original features, thus it is often difficult to interpret the results. In this paper, we propose a novel dimensionality reduction framework, called Unified Sparse Subspace Learning (USSL), for learning sparse projections. USSL casts the problem of learning the projective functions into a regression framework, which facilitates the use of different kinds of regularizes. By using a L1-norm regularizer (lasso), the sparse projections can be efficiently computed. Experimental results on real world classification and clustering problems demonstrate the effectiveness of our method.

Journal ArticleDOI
TL;DR: In this paper, a robust projection-pursuit-based method for principal component analysis (PCA) is proposed for the analysis of chemical data, where the number of variables is typically large.

Journal ArticleDOI
TL;DR: In this article, the results of an earlier statistical comparison of rotational schemes to monthly Pennsylvanian precipitation data (1958-1978) were analyzed to analyse differences among the various solutions.
Abstract: Climate regionalization studies have made intensive use of eigenvector analysis in recent literature. This analysis provides a motivation for examination of the efficacy and validity of variations of principal component analysis (PCA) for such tasks as an eigenvector-based regionalization. Specifically, this study applies the results of an earlier statistical comparison of rotational schemes to monthly Pennsylvanian precipitation data (1958–1978) to analyse differences among the various solutions. Unrotated, orthogonally rotated, and obliquely rotated solutions (eight in total) are compared in order to assess the model and locational consistency among and within these solutions. Model correspondence and consistency are measured by a congruence coefficient used to match (i) the principal components (PC) of the total domain for the selected benchmark pattern with PCs from the total domains of the remaining seven solutions, and (ii) each PC of the total domain with PCs of 25 randomly selected subdomain pairs (a set of 10 and 11 years of data). Locational or geographical consistency among the PC patterns is determined by quantifying the changes in area and area boundary defined by a threshold loading. The results from the Pennsylvanian data indicated that substantial differences in regionalization arose solely from the choice of a particular rotation algorithm, or lack thereof. Oblique rotations were generally found to be the most stable, whereas the orthogonally rotated and unrotated solutions were less stable. The quantitative areal differences among rotation schemes and the unrotated solution illustrate the inherent danger in blindly applying any given solution if physical interpretation of the regionalization is important. The quantitative areal and boundary differences may particularly influence PCA over global spatial domains.