scispace - formally typeset
Search or ask a question
Author

Johan A. Westerhuis

Bio: Johan A. Westerhuis is an academic researcher from University of Amsterdam. The author has contributed to research in topics: Batch processing & Principal component analysis. The author has an hindex of 42, co-authored 150 publications receiving 9780 citations. Previous affiliations of Johan A. Westerhuis include Radboud University Nijmegen & North-West University.


Papers
More filters
Journal ArticleDOI
TL;DR: Range scaling and autoscaling were able to remove the dependence of the rank of the metabolites on the average concentration and the magnitude of the fold changes and showed biologically sensible results after PCA (principal component analysis).
Abstract: Extracting relevant biological information from large data sets is a major challenge in functional genomics research. Different aspects of the data hamper their biological interpretation. For instance, 5000-fold differences in concentration for different metabolites are present in a metabolomics data set, while these differences are not proportional to the biological relevance of these metabolites. However, data analysis methods are not able to make this distinction. Data pretreatment methods can correct for aspects that hinder the biological interpretation of metabolomics data sets by emphasizing the biological information in the data set and thus improving their biological interpretability. Different data pretreatment methods, i.e. centering, autoscaling, pareto scaling, range scaling, vast scaling, log transformation, and power transformation, were tested on a real-life metabolomics data set. They were found to greatly affect the outcome of the data analysis and thus the rank of the, from a biological point of view, most important metabolites. Furthermore, the stability of the rank, the influence of technical errors on data analysis, and the preference of data analysis methods for selecting highly abundant metabolites were affected by the data pretreatment method used prior to data analysis. Different pretreatment methods emphasize different aspects of the data and each pretreatment method has its own merits and drawbacks. The choice for a pretreatment method depends on the biological question to be answered, the properties of the data set and the data analysis method selected. For the explorative analysis of the validation data set used in this study, autoscaling and range scaling performed better than the other pretreatment methods. That is, range scaling and autoscaling were able to remove the dependence of the rank of the metabolites on the average concentration and the magnitude of the fold changes and showed biologically sensible results after PCA (principal component analysis). In conclusion, selecting a proper data pretreatment method is an essential step in the analysis of metabolomics data and greatly affects the metabolites that are identified to be the most important.

1,987 citations

Journal ArticleDOI
TL;DR: A strategy based on cross model validation and permutation testing to validate the classification models and advocate against the use of PLSDA score plots for inference of class differences is discussed.
Abstract: Classifying groups of individuals based on their metabolic profile is one of the main topics in metabolomics research. Due to the low number of individuals compared to the large number of variables, this is not an easy task. PLSDA is one of the data analysis methods used for the classification. Unfortunately this method eagerly overfits the data and rigorous validation is necessary. The validation however is far from straightforward. Is this paper we will discuss a strategy based on cross model validation and permutation testing to validate the classification models. It is also shown that too optimistic results are obtained when the validation is not done properly. Furthermore, we advocate against the use of PLSDA score plots for inference of class differences.

1,216 citations

Journal ArticleDOI
TL;DR: It is recommended that in cases where the variables can be separated into meaningful blocks, the standard PCA and PLS methods be used to build the models and then the weights and loadings of the individual blocks and super block and the percentage variation explained in each block be calculated from the results.
Abstract: Multiblock and hierarchical PCA and PLS methods have been proposed in the recent literature in order to improve the interpretability of multivariate models. They have been used in cases where the number of variables is large and additional information is available for blocking the variables into conceptually meaningful blocks. In this paper we compare these methods from a theoretical or algorithmic viewpoint using a common notation and illustrate their differences with several case studies. Undesirable properties of some of these methods, such as convergence problems or loss of data information due to deflation procedures, are pointed out and corrected where possible. It is shown that the objective function of the hierarchical PCA and hierarchical PLS methods is not clear and the corresponding algorithms may converge to different solutions depending on the initial guess of the super score. It is also shown that the results of consensus PCA (CPCA) and multiblock PLS (MBPLS) can be calculated from the standard PCA and PLS methods when the same variable scalings are applied for these methods. The standard PCA and PLS methods require less computation and give better estimation of the scores in the case of missing data. It is therefore recommended that in cases where the variables can be separated into meaningful blocks, the standard PCA and PLS methods be used to build the models and then the weights and loadings of the individual blocks and super block and the percentage variation explained in each block be calculated from the results. © 1998 John Wiley & Sons, Ltd.

682 citations

Journal ArticleDOI
TL;DR: Property of four diagnostic statistics of PLS-DA, namely the number of misclassifications (NMC), the Area Under the Receiver Operating Characteristic (AUROC), Q2 and Discriminant Q2 (DQ2) are discussed, seem more efficient and more reliable diagnostic statistics and should be recommended in two group discrimination metabolomic studies.
Abstract: Partial Least Squares-Discriminant Analysis (PLS-DA) is a PLS regression method with a special binary ‘dummy’ y-variable and it is commonly used for classification purposes and biomarker selection in metabolomics studies. Several statistical approaches are currently in use to validate outcomes of PLS-DA analyses e.g. double cross validation procedures or permutation testing. However, there is a great inconsistency in the optimization and the assessment of performance of PLS-DA models due to many different diagnostic statistics currently employed in metabolomics data analyses. In this paper, properties of four diagnostic statistics of PLS-DA, namely the number of misclassifications (NMC), the Area Under the Receiver Operating Characteristic (AUROC), Q 2 and Discriminant Q 2 (DQ 2) are discussed. All four diagnostic statistics are used in the optimization and the performance assessment of PLS-DA models of three different-size metabolomics data sets obtained with two different types of analytical platforms and with different levels of known differences between two groups: control and case groups. Statistical significance of obtained PLS-DA models was evaluated with permutation testing. PLS-DA models obtained with NMC and AUROC are more powerful in detecting very small differences between groups than models obtained with Q 2 and Discriminant Q 2 (DQ 2). Reproducibility of obtained PLS-DA models outcomes, models complexity and permutation test distributions are also investigated to explain this phenomenon. DQ 2 and Q 2 (in contrary to NMC and AUROC) prefer PLS-DA models with lower complexity and require higher number of permutation tests and submodels to accurately estimate statistical significance of the model performance. NMC and AUROC seem more efficient and more reliable diagnostic statistics and should be recommended in two group discrimination metabolomic studies.

602 citations

Journal ArticleDOI
TL;DR: Control limits for both types of contributions are introduced to show the relative importance of a contribution compared to the contributions of the corresponding process variables in the batches obtained under normal operating conditions.

488 citations


Cited by
More filters
Book ChapterDOI
01 Jan 2010

5,842 citations

Book
01 Oct 2006
TL;DR: Compounding, Contemporary, andProcess Chemistry in the Pharmaceutical Industry: Characterization and Function, Volume 19.
Abstract: Contributors to Volume 19Contents of Other VolumesBlow-Fill-Seal Aseptic ProcessingDeborah J. JonesCompounding, ContemporaryLoyd V. Allen, Jr.Drug Delivery-Oral Colon SpecificVincent H. L. Lee and Suman K. MukherjeeThe European Agency for the Evaluation of Medicinal Product (EMEA)David JacobsHarmonization-CompendiaLee. T. Grady and Jerome A. HalperinInhalation, Dry PowderLynn Van Campen and Geraldine VenthoyeLiquid Crystals in Drug DeliveryChristel C. Mueller-GoymannMedication Errors: A New Challenge for the Pharmaceutical IndustryDiane R. CousinsMucoadhesive Hydrogels in Drug DeliveryHans E. Junginger, Maya Thanou, and J. Coos VerhoefPeptides and Proteins: Buccal AbsorptionHemant H. Alur, and Thomas P. Johnston, and Ashim K. MitraPharmaceutical Quality Assurance Microbiology LaboratoriesAnthony M. CundellProcess Chemistry in the Pharmaceutical IndustryKumar G. Gadamasetti and Ambarish K. SinghRadiolabeling of Pharmaceutical Aerosols and Gamma Scintigraphic Imaging for Lung DepositionHak-Kim ChanSuper Disintegrants: Characterization and FunctionLarry L. Augsburger, Albert W. Brzeczko, Umang Shah, and Huijeong A. HahmIndex to Volume 20

2,683 citations

Journal ArticleDOI
TL;DR: MixOmics is introduced, an R package dedicated to the multivariate analysis of biological data sets with a specific focus on data exploration, dimension reduction and visualisation and extends Projection to Latent Structure models for discriminant analysis.
Abstract: The advent of high throughput technologies has led to a wealth of publicly available 'omics data coming from different sources, such as transcriptomics, proteomics, metabolomics. Combining such large-scale biological data sets can lead to the discovery of important biological insights, provided that relevant information can be extracted in a holistic manner. Current statistical approaches have been focusing on identifying small subsets of molecules (a 'molecular signature') to explain or predict biological conditions, but mainly for a single type of 'omics. In addition, commonly used methods are univariate and consider each biological feature independently. We introduce mixOmics, an R package dedicated to the multivariate analysis of biological data sets with a specific focus on data exploration, dimension reduction and visualisation. By adopting a systems biology approach, the toolkit provides a wide range of methods that statistically integrate several data sets at once to probe relationships between heterogeneous 'omics data sets. Our recent methods extend Projection to Latent Structure (PLS) models for discriminant analysis, for data integration across multiple 'omics data or across independent studies, and for the identification of molecular signatures. We illustrate our latest mixOmics integrative frameworks for the multivariate analyses of 'omics data available from the package.

1,862 citations

Journal ArticleDOI
TL;DR: The paper focuses on the use of principal component analysis in typical chemometric areas but the results are generally applicable.
Abstract: Principal component analysis is one of the most important and powerful methods in chemometrics as well as in a wealth of other areas. This paper provides a description of how to understand, use, and interpret principal component analysis. The paper focuses on the use of principal component analysis in typical chemometric areas but the results are generally applicable.

1,622 citations