Classification tools in chemistry. Part 1: linear models. PLS-DA

doi:10.1039/C3AY40582F

Home
/
Papers
/
Classification tools in chemistry. Part 1: linear models. PLS-DA

Journal Article•DOI•

Classification tools in chemistry. Part 1: linear models. PLS-DA

Davide Ballabio¹, Viviana Consonni¹•Institutions (1)

University of Milano-Bicocca¹

26 Jul 2013-Analytical Methods (The Royal Society of Chemistry)-Vol. 5, Iss: 16, pp 3790-3798

TL;DR: The common steps to calibrate and validate classification models based on partial least squares discriminant analysis are discussed in the present tutorial, and issues to be evaluated during model training and validation are introduced and explained using a chemical dataset.

read less

Abstract: The common steps to calibrate and validate classification models based on partial least squares discriminant analysis are discussed in the present tutorial. All issues to be evaluated during model training and validation are introduced and explained using a chemical dataset, composed of toxic and non-toxic sediment samples. The analysis was carried out with MATLAB routines, which are available in the ESI of this tutorial, together with the dataset and a detailed list of all MATLAB instructions used for the analysis.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

On-line application of near infrared (NIR) spectroscopy in food production

[...]

Jan U. Porep¹, Dietmar R. Kammerer, Reinhold Carle², Reinhold Carle¹•Institutions (2)

University of Hohenheim¹, King Abdulaziz University²

01 Dec 2015-Trends in Food Science and Technology

TL;DR: In this paper, the authors present a review of the application of NIR spectroscopy in the food processing industry, focusing on studies dealing with on-line application of industrial processes in food industry, which were categorized according to their application conditions into semi-industrial scale and industrial scale.

...read moreread less

Abstract: Near infrared (NIR) spectroscopy represents an emerging analytical technique, which is enjoying increasing popularity in the food processing industry due to its low running costs, and since it does not require sample preparation. Moreover, it is a non-destructive, environmental friendly, rapid technique capable for on-line application. Therefore, this technique is predestined for implementation as an analytical tool in industrial processing. The different fields of application of NIR spectroscopy reported in the present review highlight its enormous versatility. Quantitative analyses of chemical constituents using this methodology are widespread. Moreover, a wide range of qualitative determinations, e.g. for authenticity control, sample discrimination, the assessment of sensory, rheological or technological properties, and physical attributes have been reported. Both animal- and plant-derived foodstuffs have been evaluated in this context. Highly diverse matrices such as intact solid samples, free-flowing solids, pasty, and fluid samples can by analysed by NIR spectroscopy. Sophisticated conditions for the application in industrial scale comprise among others measurements on moving conveyor belts, in continuous flows in tubes, and monitoring of fermentation processes. For such purposes, different construction designs of NIR spectrometers for hyperspectral imaging, portable devices, fibre optical and direct contact probes as well as tube integrated probes measuring through windows, and automated sample cell loading have been developed. In the present review, emphasis was put on studies dealing with on-line application of NIR spectroscopy for industrial processes in the food industry, which were categorised according to their application conditions into semi-industrial scale and industrial scale.

...read moreread less

394 citations

Journal Article•DOI•

On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning

[...]

Yun Xu¹, Royston Goodacre², Royston Goodacre¹•Institutions (2)

University of Manchester¹, University of Liverpool²

01 Jan 2018

TL;DR: A comparative study on various reported data splitting methods found that the size of the data is the deciding factor for the qualities of the generalization performance estimated from the validation set, suggesting that it is necessary to have a good balance between the sizes of training set and validation set toHave a reliable estimation of model performance.

...read moreread less

Abstract: Model validation is the most important part of building a supervised model. For building a model with good generalization performance one must have a sensible data splitting strategy, and this is crucial for model validation. In this study, we conducted a comparative study on various reported data splitting methods. The MixSim model was employed to generate nine simulated datasets with different probabilities of mis-classification and variable sample sizes. Then partial least squares for discriminant analysis and support vector machines for classification were applied to these datasets. Data splitting methods tested included variants of cross-validation, bootstrapping, bootstrapped Latin partition, Kennard-Stone algorithm (K-S) and sample set partitioning based on joint X–Y distances algorithm (SPXY). These methods were employed to split the data into training and validation sets. The estimated generalization performances from the validation sets were then compared with the ones obtained from the blind test sets which were generated from the same distribution but were unseen by the training/validation procedure used in model construction. The results showed that the size of the data is the deciding factor for the qualities of the generalization performance estimated from the validation set. We found that there was a significant gap between the performance estimated from the validation set and the one from the test set for the all the data splitting methods employed on small datasets. Such disparity decreased when more samples were available for training/validation, and this is because the models were then moving towards approximations of the central limit theory for the simulated datasets used. We also found that having too many or too few samples in the training set had a negative effect on the estimated model performance, suggesting that it is necessary to have a good balance between the sizes of training set and validation set to have a reliable estimation of model performance. We also found that systematic sampling method such as K-S and SPXY generally had very poor estimation of the model performance, most likely due to the fact that they are designed to take the most representative samples first and thus left a rather poorly representative sample set for model performance estimation.

...read moreread less

380 citations

Journal Article•DOI•

Partial least squares-discriminant analysis (PLS-DA) for classification of high-dimensional (HD) data: a review of contemporary practice strategies and knowledge gaps

[...]

Loong Chuen Lee¹, Choong Yeun Liong¹, Abdul Aziz Jemain¹•Institutions (1)

National University of Malaysia¹

23 Jul 2018-Analyst

TL;DR: The aim of the article is to review, outline and describe the contemporary PLS-DA modelling practice strategies, and to critically discuss the respective knowledge gaps that have emerged in response to the present big data era.

...read moreread less

Abstract: Partial least squares-discriminant analysis (PLS-DA) is a versatile algorithm that can be used for predictive and descriptive modelling as well as for discriminative variable selection. However, versatility is both a blessing and a curse and the user needs to optimize a wealth of parameters before reaching reliable and valid outcomes. Over the past two decades, PLS-DA has demonstrated great success in modelling high-dimensional datasets for diverse purposes, e.g. product authentication in food analysis, diseases classification in medical diagnosis, and evidence analysis in forensic science. Despite that, in practice, many users have yet to grasp the essence of constructing a valid and reliable PLS-DA model. As the technology progresses, across every discipline, datasets are evolving into a more complex form, i.e. multi-class, imbalanced and colossal. Indeed, the community is welcoming a new era called big data. In this context, the aim of the article is two-fold: (a) to review, outline and describe the contemporary PLS-DA modelling practice strategies, and (b) to critically discuss the respective knowledge gaps that have emerged in response to the present big data era. This work could complement other available reviews or tutorials on PLS-DA, to provide a timely and user-friendly guide to researchers, especially those working in applied research.

...read moreread less

357 citations

Journal Article•DOI•

Application of multivariate statistical techniques in microbial ecology.

[...]

Oleg Paliy¹, Vijay Shankar¹•Institutions (1)

Wright State University¹

01 Mar 2016-Molecular Ecology

TL;DR: This review describes and compares the most widely used multivariate statistical techniques including exploratory, interpretive and discriminatory procedures, and presents examples of how these approaches have been utilized in recent studies to provide insight into the ecology of the microbial world.

...read moreread less

Abstract: Recent advances in high-throughput methods of molecular analyses have led to an explosion of studies generating large-scale ecological data sets. In particular, noticeable effect has been attained in the field of microbial ecology, where new experimental approaches provided in-depth assessments of the composition, functions and dynamic changes of complex microbial communities. Because even a single high-throughput experiment produces large amount of data, powerful statistical techniques of multivariate analysis are well suited to analyse and interpret these data sets. Many different multivariate techniques are available, and often it is not clear which method should be applied to a particular data set. In this review, we describe and compare the most widely used multivariate statistical techniques including exploratory, interpretive and discriminatory procedures. We consider several important limitations and assumptions of these methods, and we present examples of how these approaches have been utilized in recent studies to provide insight into the ecology of the microbial world. Finally, we offer suggestions for the selection of appropriate methods based on the research question and data set structure.

...read moreread less

314 citations

Additional excerpts

...**Described in Ballabio & Consonni (2013)....
[...]

Journal Article•DOI•

Multivariate comparison of classification performance measures

[...]

Davide Ballabio¹, Francesca Grisoni¹, Francesca Grisoni², Roberto Todeschini¹•Institutions (2)

University of Milano-Bicocca¹, ETH Zurich²

01 Dec 2017-Chemometrics and Intelligent Laboratory Systems

TL;DR: In this study, different global measures of classification performances are compared by means of results achieved on an extended set of real multivariate datasets and a set of benchmark values based on different random classification scenarios are introduced.

...read moreread less

173 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Beware of q2

[...]

Alexander Golbraikh¹, Alexander Tropsha¹•Institutions (1)

University of North Carolina at Chapel Hill¹

01 Jan 2002-Journal of Molecular Graphics & Modelling

TL;DR: It is argued that the high value of LOO q2 appears to be the necessary but not the sufficient condition for the model to have a high predictive power, which is the general property of QSAR models developed using LOO cross-validation.

...read moreread less

Abstract: Validation is a crucial aspect of any quantitative structure-activity relationship (QSAR) modeling. This paper examines one of the most popular validation criteria, leave-one-out cross-validated R2 (LOO q2). Often, a high value of this statistical characteristic (q2 > 0.5) is considered as a proof of the high predictive ability of the model. In this paper, we show that this assumption is generally incorrect. In the case of 3D QSAR, the lack of the correlation between the high LOO q2 and the high predictive ability of a QSAR model has been established earlier [Pharm. Acta Helv. 70 (1995) 149; J. Chemomet. 10(1996)95; J. Med. Chem. 41 (1998) 2553]. In this paper, we use two-dimensional (2D) molecular descriptors and k nearest neighbors (kNN) QSAR method for the analysis of several datasets. No correlation between the values of q2 for the training set and predictive ability for the test set was found for any of the datasets. Thus, the high value of LOO q2 appears to be the necessary but not the sufficient condition for the model to have a high predictive power. We argue that this is the general property of QSAR models developed using LOO cross-validation. We emphasize that the external validation is the only way to establish a reliable QSAR model. We formulate a set of criteria for evaluation of predictive ability of QSAR models.

...read moreread less

3,176 citations

Journal Article•DOI•

Computer Aided Design of Experiments

[...]

Robert W. Kennard¹, L. A. Stone¹•Institutions (1)

DuPont¹

01 Feb 1969-Technometrics

TL;DR: A computer oriented method which assists in the construction of response surface type experimental plans takes into account constraints met in practice that standard procedures do not consider explicitly.

...read moreread less

Abstract: A computer oriented method which assists in the construction of response surface type experimental plans is described. It takes into account constraints met in practice that standard procedures do not consider explicitly. The method is a sequential one and each step covers the experimental region uniformly. Applications to well-known situations are given to demonstrate the reasonableness of the procedure. Application to a ‘messy” design situation is given to demonstrate its novelty.

...read moreread less

2,667 citations

Journal Article•DOI•

PLS regression methods

[...]

Agnar Höskuldsson

01 Jun 1988-Journal of Chemometrics

TL;DR: In this paper, the mathematical and statistical structure of PLS regression is developed and the PLS decomposition of the data matrices involved in model building is analyzed. But the PLP regression algorithm can be interpreted in a model building setting.

...read moreread less

Abstract: In this paper we develop the mathematical and statistical structure of PLS regression We show the PLS regression algorithm and how it can be interpreted in model building The basic mathematical principles that lie behind two block PLS are depicted We also show the statistical aspects of the PLS method when it is used for model building Finally we show the structure of the PLS decompositions of the data matrices involved

...read moreread less

1,778 citations

Journal Article•DOI•

Kohonen and counterpropagation artificial neural networks in analytical chemistry

[...]

Jure Zupan, Marjana Novič, Itziar Ruisánchez

01 Aug 1997-Chemometrics and Intelligent Laboratory Systems

TL;DR: The principles of the Kohonen and counterpropagation artificial neural network (K-ANN and CP-ANN) learning strategy is described and the use of both methods is explained with several examples from analytical chemistry.

...read moreread less

250 citations

Journal Article•DOI•

Calculation of the reliability of classification in discriminant partial least-squares binary classification

[...]

Néstor F. Pérez¹, Joan Ferré¹, Ricard Boqué¹•Institutions (1)

Rovira i Virgili University¹

15 Feb 2009-Chemometrics and Intelligent Laboratory Systems

TL;DR: This method, called Probabilistic Discriminant Partial Least Squares (p-DPLS), integrates DPLS, density methods and Bayes decision theory in order to take into account the uncertainty of the predictions in DPLs.

...read moreread less

142 citations