scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Classification tools in chemistry. Part 1: linear models. PLS-DA

26 Jul 2013-Analytical Methods (The Royal Society of Chemistry)-Vol. 5, Iss: 16, pp 3790-3798
TL;DR: The common steps to calibrate and validate classification models based on partial least squares discriminant analysis are discussed in the present tutorial, and issues to be evaluated during model training and validation are introduced and explained using a chemical dataset.
Abstract: The common steps to calibrate and validate classification models based on partial least squares discriminant analysis are discussed in the present tutorial. All issues to be evaluated during model training and validation are introduced and explained using a chemical dataset, composed of toxic and non-toxic sediment samples. The analysis was carried out with MATLAB routines, which are available in the ESI of this tutorial, together with the dataset and a detailed list of all MATLAB instructions used for the analysis.
Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, the authors present a review of the application of NIR spectroscopy in the food processing industry, focusing on studies dealing with on-line application of industrial processes in food industry, which were categorized according to their application conditions into semi-industrial scale and industrial scale.
Abstract: Near infrared (NIR) spectroscopy represents an emerging analytical technique, which is enjoying increasing popularity in the food processing industry due to its low running costs, and since it does not require sample preparation. Moreover, it is a non-destructive, environmental friendly, rapid technique capable for on-line application. Therefore, this technique is predestined for implementation as an analytical tool in industrial processing. The different fields of application of NIR spectroscopy reported in the present review highlight its enormous versatility. Quantitative analyses of chemical constituents using this methodology are widespread. Moreover, a wide range of qualitative determinations, e.g. for authenticity control, sample discrimination, the assessment of sensory, rheological or technological properties, and physical attributes have been reported. Both animal- and plant-derived foodstuffs have been evaluated in this context. Highly diverse matrices such as intact solid samples, free-flowing solids, pasty, and fluid samples can by analysed by NIR spectroscopy. Sophisticated conditions for the application in industrial scale comprise among others measurements on moving conveyor belts, in continuous flows in tubes, and monitoring of fermentation processes. For such purposes, different construction designs of NIR spectrometers for hyperspectral imaging, portable devices, fibre optical and direct contact probes as well as tube integrated probes measuring through windows, and automated sample cell loading have been developed. In the present review, emphasis was put on studies dealing with on-line application of NIR spectroscopy for industrial processes in the food industry, which were categorised according to their application conditions into semi-industrial scale and industrial scale.

394 citations

Journal ArticleDOI
01 Jan 2018
TL;DR: A comparative study on various reported data splitting methods found that the size of the data is the deciding factor for the qualities of the generalization performance estimated from the validation set, suggesting that it is necessary to have a good balance between the sizes of training set and validation set toHave a reliable estimation of model performance.
Abstract: Model validation is the most important part of building a supervised model. For building a model with good generalization performance one must have a sensible data splitting strategy, and this is crucial for model validation. In this study, we conducted a comparative study on various reported data splitting methods. The MixSim model was employed to generate nine simulated datasets with different probabilities of mis-classification and variable sample sizes. Then partial least squares for discriminant analysis and support vector machines for classification were applied to these datasets. Data splitting methods tested included variants of cross-validation, bootstrapping, bootstrapped Latin partition, Kennard-Stone algorithm (K-S) and sample set partitioning based on joint X–Y distances algorithm (SPXY). These methods were employed to split the data into training and validation sets. The estimated generalization performances from the validation sets were then compared with the ones obtained from the blind test sets which were generated from the same distribution but were unseen by the training/validation procedure used in model construction. The results showed that the size of the data is the deciding factor for the qualities of the generalization performance estimated from the validation set. We found that there was a significant gap between the performance estimated from the validation set and the one from the test set for the all the data splitting methods employed on small datasets. Such disparity decreased when more samples were available for training/validation, and this is because the models were then moving towards approximations of the central limit theory for the simulated datasets used. We also found that having too many or too few samples in the training set had a negative effect on the estimated model performance, suggesting that it is necessary to have a good balance between the sizes of training set and validation set to have a reliable estimation of model performance. We also found that systematic sampling method such as K-S and SPXY generally had very poor estimation of the model performance, most likely due to the fact that they are designed to take the most representative samples first and thus left a rather poorly representative sample set for model performance estimation.

380 citations

Journal ArticleDOI
23 Jul 2018-Analyst
TL;DR: The aim of the article is to review, outline and describe the contemporary PLS-DA modelling practice strategies, and to critically discuss the respective knowledge gaps that have emerged in response to the present big data era.
Abstract: Partial least squares-discriminant analysis (PLS-DA) is a versatile algorithm that can be used for predictive and descriptive modelling as well as for discriminative variable selection. However, versatility is both a blessing and a curse and the user needs to optimize a wealth of parameters before reaching reliable and valid outcomes. Over the past two decades, PLS-DA has demonstrated great success in modelling high-dimensional datasets for diverse purposes, e.g. product authentication in food analysis, diseases classification in medical diagnosis, and evidence analysis in forensic science. Despite that, in practice, many users have yet to grasp the essence of constructing a valid and reliable PLS-DA model. As the technology progresses, across every discipline, datasets are evolving into a more complex form, i.e. multi-class, imbalanced and colossal. Indeed, the community is welcoming a new era called big data. In this context, the aim of the article is two-fold: (a) to review, outline and describe the contemporary PLS-DA modelling practice strategies, and (b) to critically discuss the respective knowledge gaps that have emerged in response to the present big data era. This work could complement other available reviews or tutorials on PLS-DA, to provide a timely and user-friendly guide to researchers, especially those working in applied research.

357 citations

Journal ArticleDOI
TL;DR: This review describes and compares the most widely used multivariate statistical techniques including exploratory, interpretive and discriminatory procedures, and presents examples of how these approaches have been utilized in recent studies to provide insight into the ecology of the microbial world.
Abstract: Recent advances in high-throughput methods of molecular analyses have led to an explosion of studies generating large-scale ecological data sets. In particular, noticeable effect has been attained in the field of microbial ecology, where new experimental approaches provided in-depth assessments of the composition, functions and dynamic changes of complex microbial communities. Because even a single high-throughput experiment produces large amount of data, powerful statistical techniques of multivariate analysis are well suited to analyse and interpret these data sets. Many different multivariate techniques are available, and often it is not clear which method should be applied to a particular data set. In this review, we describe and compare the most widely used multivariate statistical techniques including exploratory, interpretive and discriminatory procedures. We consider several important limitations and assumptions of these methods, and we present examples of how these approaches have been utilized in recent studies to provide insight into the ecology of the microbial world. Finally, we offer suggestions for the selection of appropriate methods based on the research question and data set structure.

314 citations


Additional excerpts

  • ...**Described in Ballabio & Consonni (2013)....

    [...]

Journal ArticleDOI
TL;DR: In this study, different global measures of classification performances are compared by means of results achieved on an extended set of real multivariate datasets and a set of benchmark values based on different random classification scenarios are introduced.

173 citations

References
More filters
Journal ArticleDOI
TL;DR: It is argued that the high value of LOO q2 appears to be the necessary but not the sufficient condition for the model to have a high predictive power, which is the general property of QSAR models developed using LOO cross-validation.
Abstract: Validation is a crucial aspect of any quantitative structure-activity relationship (QSAR) modeling. This paper examines one of the most popular validation criteria, leave-one-out cross-validated R2 (LOO q2). Often, a high value of this statistical characteristic (q2 > 0.5) is considered as a proof of the high predictive ability of the model. In this paper, we show that this assumption is generally incorrect. In the case of 3D QSAR, the lack of the correlation between the high LOO q2 and the high predictive ability of a QSAR model has been established earlier [Pharm. Acta Helv. 70 (1995) 149; J. Chemomet. 10(1996)95; J. Med. Chem. 41 (1998) 2553]. In this paper, we use two-dimensional (2D) molecular descriptors and k nearest neighbors (kNN) QSAR method for the analysis of several datasets. No correlation between the values of q2 for the training set and predictive ability for the test set was found for any of the datasets. Thus, the high value of LOO q2 appears to be the necessary but not the sufficient condition for the model to have a high predictive power. We argue that this is the general property of QSAR models developed using LOO cross-validation. We emphasize that the external validation is the only way to establish a reliable QSAR model. We formulate a set of criteria for evaluation of predictive ability of QSAR models.

3,176 citations

Journal ArticleDOI
Robert W. Kennard1, L. A. Stone1
TL;DR: A computer oriented method which assists in the construction of response surface type experimental plans takes into account constraints met in practice that standard procedures do not consider explicitly.
Abstract: A computer oriented method which assists in the construction of response surface type experimental plans is described. It takes into account constraints met in practice that standard procedures do not consider explicitly. The method is a sequential one and each step covers the experimental region uniformly. Applications to well-known situations are given to demonstrate the reasonableness of the procedure. Application to a ‘messy” design situation is given to demonstrate its novelty.

2,667 citations

Journal ArticleDOI
TL;DR: In this paper, the mathematical and statistical structure of PLS regression is developed and the PLS decomposition of the data matrices involved in model building is analyzed. But the PLP regression algorithm can be interpreted in a model building setting.
Abstract: In this paper we develop the mathematical and statistical structure of PLS regression We show the PLS regression algorithm and how it can be interpreted in model building The basic mathematical principles that lie behind two block PLS are depicted We also show the statistical aspects of the PLS method when it is used for model building Finally we show the structure of the PLS decompositions of the data matrices involved

1,778 citations

Journal ArticleDOI
TL;DR: The principles of the Kohonen and counterpropagation artificial neural network (K-ANN and CP-ANN) learning strategy is described and the use of both methods is explained with several examples from analytical chemistry.

250 citations

Journal ArticleDOI
TL;DR: This method, called Probabilistic Discriminant Partial Least Squares (p-DPLS), integrates DPLS, density methods and Bayes decision theory in order to take into account the uncertainty of the predictions in DPLs.

142 citations