scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Chemometrics in 2003"


Journal ArticleDOI
TL;DR: It is demonstrated that the reconstruction-based framework provides a convenient way for fault analysis, including fault detectability, reconstructability and identifiability conditions, resolving many theoretical issues in process monitoring.
Abstract: This paper provides an overview and analysis of statistical process monitoring methods for fault detection, identification and reconstruction. Several fault detection indices in the literature are analyzed and unified. Fault reconstruction for both sensor and process faults is presented which extends the traditional missing value replacement method. Fault diagnosis methods that have appeared recently are reviewed. The reconstruction-based approach and the contribution-based approach are analyzed and compared with simulation and industrial examples. The complementary nature of the reconstruction- and contribution-based approaches is highlighted. An industrial example of polyester film process monitoring is given to demonstrate the power of the contribution- and reconstruction-based approaches in a hierarchical monitoring framework. Finally we demonstrate that the reconstruction-based framework provides a convenient way for fault analysis, including fault detectability, reconstructability and identifiability conditions, resolving many theoretical issues in process monitoring. Additional topics are summarized at the end of the paper for future investigation. Copyright © 2003 John Wiley & Sons, Ltd.

1,408 citations


Journal ArticleDOI
TL;DR: The core consistency diagnostic (CORCONDIA) as discussed by the authors is a diagnostic for determining the appropriate number of components for multiway models, which is based on scrutinizing the appropriateness of the structural model based on the data and the estimated parameters.
Abstract: A new diagnostic called the core consistency diagnostic (CORCONDIA) is suggested for determining the proper number of components for multiway models. It applies especially to the parallel factor analysis (PARAFAC) model, but also to other models that can be considered as restricted Tucker3 models. It is based on scrutinizing the ‘appropriateness’ of the structural model based on the data and the estimated parameters of gradually augmented models. A PARAFAC model (employing dimension-wise combinations of components for all modes) is called appropriate if adding other combinations of the same components does not improve the fit considerably. It is proposed to choose the largest model that is still sufficiently appropriate. Using examples from a range of different types of data, it is shown that the core consistency diagnostic is an effective tool for determining the appropriate number of components in e.g. PARAFAC models. However, it is also shown, using simulated data, that the theoretical understanding of CORCONDIA is not yet complete. Copyright © 2003 John Wiley & Sons, Ltd.

1,110 citations


Journal ArticleDOI
TL;DR: In this article, the authors present a dedicated investigation and practical description of how to apply PARAFAC modeling to complicated fluorescence excitation-emission measurements, including choosing the right number of components, handling problems with missing values and scatter, detecting variables influenced by noise and identifying outliers.
Abstract: This paper presents a dedicated investigation and practical description of how to apply PARAFAC modeling to complicated fluorescence excitation-emission measurements. The steps involved in finding the optimal PARAFAC model are described in detail based on the characteristics of fluorescence data. These steps include choosing the right number of components, handling problems with missing values and scatter, detecting variables influenced by noise and identifying outliers. Various validation methods are applied in order to ensure that the optimal model has been found and several common data-specific problems and their solutions are explained. Finally, interpretations of the specific models are given. The paper can be used as a tutorial for investigating fluorescence landscapes with multi-way analysis.

561 citations


Journal ArticleDOI
TL;DR: In this paper, the purpose and use of centering and scaling are discussed in depth, and the results can easily be generalized to multiway data analysis, but the main focus is on two-way bilinear data analysis.
Abstract: In this paper the purpose and use of centering and scaling are discussed in depth. The main focus is on two-way bilinear data analysis, but the results can easily be generalized to multiway data analysis. In fact, one of the scopes of this paper is to show that if two-way centering and scaling are understood, then multiway centering and scaling is quite straightforward. In the literature it is often stated that preprocessing of multiway arrays is difficult, but here it is shown that most of the difficulties do not pertain to three- and higher-way modeling in particular. It is shown that centering is most conveniently seen as a projection step, where the data are projected onto certain well-defined spaces within a given mode. This view of centering helps to explain why, for example, centering data with missing elements is likely to be suboptimal if there are many missing elements. Building a model for data consists of two parts: postulating a structural model and using a method to estimate the parameters. Centering has to do with the first part: when centering, a model including offsets is postulated. Scaling has to do with the second part: when scaling, another way of fitting the model is employed. It is shown that centering is simply a convenient technique to estimate model parameters for models with certain offsets, but this does not work for all types of offsets. It is also shown that scaling is a way to fit models with a weighted least squares loss function and that sometimes this change in objective function cannot be performed by a simple scaling step. Further practical. aspects of and alternatives to centering and scaling are discussed, and examples are used throughout to show that the conclusions in the paper are not only of theoretical interest but can have an impact on practical data analysis. Copyright (C) 2003 John Wiley Sons, Ltd

373 citations


Journal ArticleDOI
TL;DR: The O2-PLS method as mentioned in this paper is derived from the basic partial least squares projections to latent structures (PLS) prediction approach and has an integral orthogonal signal correction (OSC) filter that separates structured noise in X and Y from their joint X-Y covariation used in the prediction model.
Abstract: The O2-PLS method is derived from the basic partial least squares projections to latent structures (PLS) prediction approach. The importance of the covariation matrix (YX) is pointed out in relation to both the prediction model and the structured noise in both X and Y. Structured noise in X (or Y) is defined as the systematic variation of X (or Y) not linearly correlated with Y (or X). Examples in spectroscopy include baseline, drift and scatter effects. If structured noise is present in X, the existing latent variable regression (LVR) methods, e.g. PLS, will have weakened score-loading correspondence beyond the first component. This negatively affects the interpretation of model parameters such as scores and loadings. The O2-PLS method models and predicts both X and Y and has an integral orthogonal signal correction (OSC) filter that separates the structured noise in X and Y from their joint X-Y covariation used in the prediction model. This leads to a minimal number of predictive components with full score-loading correspondence and also an opportunity to interpret the structured noise. In both a real and a simulated example, O2-PLS and PLS gave very similar predictions of Y. However, the interpretation of the prediction models was clearly improved with O2-PLS, because structured noise was present. In the NIR example, O2-PLS revealed a strong water peak and baseline offset in the structured noise components. In the simulated example the O2-PLS plot of observed versus predicted Y-scores (u vs U) showed good predictions. The corresponding loading vectors provided good interpretation of the covarying analytes in X and Y.

326 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present a review of the improvements in the factor analysis methods that are applied in receptor modeling as well as easier application of trajectory methods for airborne particulate matter detection.
Abstract: Receptor modeling is the application of data analysis methods to elicit information on the sources of air pollutants. Typically, it employs methods of solving the mixture resolution problem using chemical composition data for airborne particulate matter samples. In such cases, the outcome is the identification of the pollution source types and estimates of the contribution of each source type to the observed concentrations. It can also involve efforts to identify the locations of the sources through the use of ensembles of air parcel back trajectories. In recent years, there have been improvements in the factor analysis methods that are applied in receptor modeling as well as easier application of trajectory methods. These developments are reviewed. Copyright © 2003 John Wiley & Sons, Ltd.

292 citations


Journal ArticleDOI
TL;DR: This paper first takes a critical look at the true nature of batch process data, then some of the methods that have appeared in the literature are examined as to their assumptions, their advantages and disadvantages and their range of applicability.
Abstract: There has been a lot of research activity in the area of batch process analysis and monitoring for abnormal situation detection since the pioneer work of Nomikos and MacGregor [1–5]. However, some of the key ideas and the thought process that led to those first papers have been forgotten. Batch process data are dynamic data. The whole philosophy of looking at batch process data with latent variables was developed because batch process variables are both autocorrelated and cross-correlated. Statistical process control by definition checks deviations from a nominal behavior (a target). Therefore for statistical process control of batch processes we should look at deviations of process variable trajectories from their nominal trajectories and from their nominal auto/cross-correlations. An added advantage to modeling the deviations from the target trajectory is that a non-linear problem is converted to a linear one that it is easy to tackle with linear latent variable methods such as principal component analysis (PCA) and partial least squares (PLS). This paper first takes a critical look at the true nature of batch process data. The general case where variables are not present during the entire duration of the batch is addressed. It is then illustrated how proper centering (by taking the deviations from the target trajectory) can retain valuable information on auto- and cross-correlation of the process variables. This auto- and cross-correlation is only modeled with a certain types of models. Topics such as scaling and trajectory alignment are revisited and issues arising when using the indicator variable approach are addressed. The development of control charts for multiblock, multiway PCA/PLS is discussed. Practical issues related to applications in industry are addressed. Then some of the methods that have appeared in the literature are examined as to their assumptions, their advantages and disadvantages and their range of applicability. Finally the nature of transition data (start-ups, grade transitions) is discussed and issues related to aligning, centering and scaling such types of data are presented. Copyright © 2003 John Wiley & Sons, Ltd.

265 citations


Journal ArticleDOI
TL;DR: In this article, the authors introduce robustified versions of the SIMPLS algorithm, which are constructed from a robust covariance matrix for high-dimensional data and robust linear regression, and introduce robust RMSECV and RMSEP values for model calibration and model validation.
Abstract: Partial least squares regression (PLSR) is a linear regression technique developed to deal with high-dimensional regressors and one or several response variables. In this paper we introduce robustified versions of the SIMPLS algorithm, this being the leading PLSR algorithm because of its speed and efficiency. Because SIMPLS is based on the empirical cross-covariance matrix between the response variables and the regressors and on linear least squares regression, the results are affected by abnormal observations in the data set. Two robust methods, RSIMCD and RSIMPLS, are constructed from a robust covariance matrix for high-dimensional data and robust linear regression. We introduce robust RMSECV and RMSEP values for model calibration and model validation. Diagnostic plots are constructed to visualize and classify the outliers. Several simulation results and the analysis of real data sets show the effectiveness and robustness of the new approaches. Because RSIMPLS is roughly twice as fast as RSIMCD, it stands out as the overall best method. Copyright © 2003 John Wiley & Sons, Ltd.

238 citations


Journal ArticleDOI
TL;DR: In this paper, a framework is provided for sequential multiblock methods, including hierarchical PCA (HPCA, two versions), consensus PCA(CPCA; two versions) and generalized PCA.
Abstract: Multiblock or multiset methods are starting to be used in chemistry and biology to study complex data sets. In chemometrics, sequential multiblock methods are popular; that is, methods that calculate one component at a time and use deflation for finding the next component. In this paper a framework is provided for sequential multiblock methods, including hierarchical PCA (HPCA; two versions), consensus PCA (CPCA; two versions) and generalized PCA (GPCA). Properties of the methods are derived and characteristics of the methods are discussed. All this is illustrated with a real five-block example from chromatography. The only methods with clear optimization criteria are GPCA and one version of CPCA. Of these, GPCA is shown to give inferior results compared with CPCA. Copyright © 2003 John Wiley & Sons, Ltd.

206 citations


Journal ArticleDOI
TL;DR: In this article, a data pre-processing method is presented for multichannel spectra from process spectrophotometers and other multi-channel instruments, which is seen as a "pre-whitening" of the spectra, and serves to make the instrument "blind" to certain interferants while retaining its analyte sensitivity.
Abstract: A data pre-processing method is presented for multichannel ‘spectra’ from process spectrophotometers and other multichannel instruments. It may be seen as a ‘pre-whitening’ of the spectra, and serves to make the instrument ‘blind’ to certain interferants while retaining its analyte sensitivity. Thereby the instrument selectivity may be improved already prior to multivariate calibration. The result is a reduced need for process perturbation or sample spiking just to generate calibration samples that span the unwanted interferants. The method consists of shrinking the multidimensional data space of the spectra in the off-axis dimensions corresponding to the spectra of these interferants. A ‘nuisance’ covariance matrix Σ is first constructed, based on prior knowledge or estimates of the major interferants' spectra, and the scaling matrix G = Σ−1/2 is defined. The pre-processing then consists of multiplying each input spectrum by G. When these scaled spectra are analysed in conventional chemometrics software by PCA, PCR, PLSR, curve resolution, etc., the modelling becomes simpler, because it does not have to account for variations in the unwanted interferants. The obtained model parameter may finally be descaled by G−1 for graphical interpretation. The pre-processing method is illustrated by the use of prior spectroscopic knowledge to simplify the multivariate calibration of a fibre optical vis/NIR process analyser. The 48-dimensional spectral space, corresponding to the 48 instrument wavelength channels used, is shrunk in two of its dimensions, defined by the known spectra of two major interferants. Successful multivariate calibration could then be obtained, based on a very small calibration sample set. Then the paper shows the pre-whitening used for reducing the number of bilinear PLSR components in multivariate calibration models. Nuisance covariance Σ is either based on the prior knowledge of interferants' spectra or based on estimating the interferants' spectral subspace from the calibration data at hand. The relationship of the pre-processing to weighted and generalized least squares from classical statistics is outlined. Copyright © 2003 John Wiley & Sons, Ltd.

99 citations


Journal ArticleDOI
TL;DR: The algorithm can align peaks up to the spectral resolution of the acquiring instrument without restrictions on the number of peaks that are to be shifted, and the speed and moderate number of meta‐parameters of the algorithm make the method suitable for on‐line implementation.
Abstract: This thesis is based on five papers addressing variance reduction in different ways. The papers have in common that they all present new numerical methods.Paper I investigates quantitative structure-retention relationships from an image processing perspective, using an artificial neural network to preprocess three-dimensional structural descriptions of the studied steroid molecules.Paper II presents a new method for computing free energies. Free energy is the quantity that determines chemical equilibria and partition coefficients. The proposed method may be used for estimating, e.g., chromatographic retention without performing experiments.Two papers (III and IV) deal with correcting deviations from bilinearity by so-called peak alignment. Bilinearity is a theoretical assumption about the distribution of instrumental data that is often violated by measured data. Deviations from bilinearity lead to increased variance, both in the data and in inferences from the data, unless invariance to the deviations is built into the model, e.g., by the use of the method proposed in paper III and extended in paper IV.Paper V addresses a generic problem in classification; namely, how to measure the goodness of different data representations, so that the best classifier may be constructed. Variance reduction is one of the pillars on which analytical chemistry rests. This thesis considers two aspects on variance reduction: before and after experiments are performed. Before experimenting, theoretical predictions of experimental outcomes may be used to direct which experiments to perform, and how to perform them (papers I and II). After experiments are performed, the variance of inferences from the measured data are affected by the method of data analysis (papers III-V).

Journal ArticleDOI
TL;DR: In this article, the authors proposed a robust principal component regression (RPCR) method for multivariate calibration model, which combines principal component analysis (PCA) on the regressors with least square regression.
Abstract: We consider the multivariate calibration model which assumes that the concentrations of several constituents of a sample are linearly related to its spectrum. Principal component regression (PCR) is widely used for the estimation of the regression parameters in this model. In the classical approach it combines principal component analysis (PCA) on the regressors with least squares regression. However, both stages yield very unreliable results when the data set contains outlying observations. We present a robust PCR (RPCR) method which also consists of two parts. First we apply a robust PCA method for high-dimensional data on the regressors, then we regress the response variables on the scores using a robust regression method. A robust RMSECV value and a robust R 2 value are proposed as exploratory tools to select the number of principal components. The prediction error is also estimated in a robust way. Moreover, we introduce several diagnostic plots which are helpful to visualize and classify the outliers. The robustness of RPCR is demonstrated through simulations and the analysis of a real data set.

Journal ArticleDOI
TL;DR: Shifted factor models as discussed by the authors have been proposed to deal with the problem of factor shifts in sequential data, where the profiles of the latent factors shift position up or down the sequence of measurements: such shifts disturb multilinearity and so standard factor/component models no longer apply.
Abstract: The factor model is modified to deal with the problem of factor shifts. This problem arises with sequential data (e.g. time series, spectra, digitized images) if the profiles of the latent factors shift position up or down the sequence of measurements: such shifts disturb multilinearity and so standard factor/component models no longer apply. To deal with this, we modify the model(s) to include explicit mathematical representation of any factor shifts present in a data set; in this way the model can both adjust for the shifts and describe/recover their patterns. Shifted factor versions of both two- and three (or higher)-way factor models are developed. The results of applying them to synthetic data support the theoretical argument that these models have stronger uniqueness properties; they can provide unique solutions in both two-way and three-way cases where equivalent non-shifted versions are under-identified. For uniqueness to hold, however, the factors must shift independently; two or more factors that show the same pattern of shifts will not be uniquely resolved if not already uniquely determined. Another important restriction is that the models, in their current form, do not work well when the shifts are accompanied by substantial changes in factor profile shape. Three-way factor models such as Parafac, and shifted factor models such as described here, may be just two of many ways that factor analysis can incorporate additional information to make the parameters identifiable. Copyright © 2003 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this paper, a new procedure is presented for wavelength interval selection with a genetic algorithm in order to improve the predictive ability of partial least squares multivariate calibration, which involves separately labelling each of the selected sensor ranges with an appropriate inclusion ranking.
Abstract: A new procedure is presented for wavelength interval selection with a genetic algorithm in order to improve the predictive ability of partial least squares multivariate calibration. It involves separately labelling each of the selected sensor ranges with an appropriate inclusion ranking. The new approach intends to alleviate overfitting without the need of preparing an independent monitoring sample set. A theoretical example is worked out in order to compare the performance of the new approach with previous implementations of genetic algorithms. Two experimental data sets are also studied: target parameters are the concentration of glucuronic acid in complex mixtures studied by Fourier transform mid-infrared spectroscopy and the octane number in gasolines monitored by near-infrared spectroscopy. Copyright © 2003 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this paper, a methodology is presented where the unknown observations are calculated as a weighted combination of the scores up to the current time point in the new batch and those previously computed from a reference data set.
Abstract: For assured through-batch process performance monitoring, a number of established bilinear and trilinear modelling techniques require data to be available for the entire duration of the batch to realize the on-line application of the nominal model. Various strategies have been proposed for the in-filling of those yet unknown values. A methodology is presented where the unknown observations are calculated as a weighted combination of the scores up to the current time point in the new batch and those previously computed from a reference data set. This approach is investigated for the trilinear technique of parallel factor analysis (PARAFAC). Modified confidence limits are then proposed for the bivariate scores plot for on-line monitoring with a PARAFAC model. The identification of those variables indicative of causing changes in process operation has been accomplished through the application of contribution plots. Based on such plots, a methodology, with associated confidence limits, is proposed for the location of those variables whose behaviour differs from that encapsulated within the reference data set. The approach is demonstrated and compared with existing techniques on a benchmark simulation of a semi-batch emulsion polymerization that has been used in similar studies. Copyright © 2003 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this article, a theory for the use of the net analyte signal vector in inverse regression is developed, where the responses of all pure analytes and interferents are assumed to be known.
Abstract: The net analyte signal and the net analyte signal vector are useful measures in building and optimizing multivariate calibration models. In this paper a theory for their use in inverse regression is developed. The theory of net analyte signal was originally derived from classical least squares in spectral calibration where the responses of all pure analytes and interferents are assumed to be known. However, in chemometrics, inverse calibration models such as partial least squares regression are more abundant and several tools for calculating the net analyte signal in inverse regression models have been proposed. These methods yield different results and most do not provide results that are in accordance with the chosen calibration model. In this paper a thorough development of a calibration-specific net analyte signal vector is given. This definition turns out to be almost identical to the one recently suggested by Faber (Anal. Chem. 1998; 70: 5108–5110). A required correction of the net analyte signal in situations with negative predicted responses is also discussed. Copyright © 2004 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: The classical favourites in chemometrics, scatter plots, are looked into more deeply and some criticism based on recent literature references is formulated for situations of principal component analysis, PARAFAC three‐way analysis and regression by partial least squares.
Abstract: In data analysis, many situations arise where plotting and visualization are helpful or an absolute requirement for understanding. There are many techniques of plotting data/parameters/residuals. These have to be understood and visualization has to be made clearly and interpreted correctly. In this paper the classical favourites in chemometrics, scatter plots, are looked into more deeply and some criticism based on recent literature references is formulated for situations of principal component analysis, PARAFAC three-way analysis and regression by partial least squares. Biplots are also afforded some attention. Examples from near-infrared spectroscopy are given as illustrations. Copyright © 2003 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this new approach the prediction errors of individual observations (obtained from cross‐validation) are compared across models incorporating varying numbers of latent variables and non‐parametric statistical methods are used to select the simplest model that provides prediction quality that is indistinguishable from that provided by more complex models.
Abstract: Model selection is an important issue when constructing multivariate calibration models using methods based on latent variables (e.g. partial least squares regression and principal component regression). It is important to select an appropriate number of latent variables to build an accurate and precise calibration model. Inclusion of too few latent variables can result in a model that is inaccurate over the complete space of interest. Inclusion of too many latent variables can result in a model that produces noisy predictions through incorporation of low-order latent variables that have little or no predictive value. Commonly used metrics for selecting the number of latent variables are based on the predicted error sum of squares (PRESS) obtained via cross-validation. In this paper a new approach for selecting the number of latent variables is proposed. In this new approach the prediction errors of individual observations (obtained from cross-validation) are compared across models incorporating varying numbers of latent variables. Based on these comparisons, non-parametric statistical methods are used to select the simplest model (least number of latent variables) that provides prediction quality that is indistinguishable from that provided by more complex models. Unlike methods based on PRESS, this new approach is robust to the effects of anomalous observations. More generally, the same approach can be used to compare the performance of any models that are applied to the same data set where reference values are available. The proposed methodology is illustrated with an industrial example involving the prediction of gasoline octane numbers from near-infrared spectra. Published in 2004 by John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: Tabu Search is a deterministic global optimization technique loosely based on concepts from artificial intelligence as discussed by the authors, which can be used for improving the quality of calibration models and can be applied to wavelength selection.
Abstract: This paper introduces Tabu Search in analytical chemistry by applying it to wavelength selection. Tabu Search is a deterministic global optimization technique loosely based on concepts from artificial intelligence. Wavelength selection is a method which can be used for improving the quality of calibration models. Tabu Search uses basic, problem-specific operators to explore a search space, and memory to keep track of parts already visited. Several implementational aspects of wavelength selection with Tabu Search will be discussed. Two ways of memorizing the search space are investigated: storing the actual solutions and storing the steps necessary to create them. Parameters associated with Tabu Search are configured with a Plackett-Burman design. In addition, two extension schemes for Tabu Search, intensification and diversification, have been implemented and are applied with good results. Eventually, two implementations of wavelength selection with Tabu Search are tested, one which searches for a solution with a constant number of wavelengths and one with a variable number of wavelengths. Both implementations are compared with results obtained by wavelength selection methods based on simulated annealing (SA) and genetic algorithms (GAs). It is demonstrated with three real-world data sets that Tabu Search performs equally well as and can be a valuable alternative to SA and GAs. The improvements in predictive abilities increased by a factor of 20 for data set 1 and by a factor of 2 for data sets 2 and 3. In addition, when the number of wavelengths in a solution is variable, measurements on the coverage of the search space show that the coverage is usually higher for Tabu Search compared with SA and GAs.

Journal ArticleDOI
TL;DR: In this article, the effect of mass balance on the analysis of two-way data of reaction or process systems is investigated and two slightly modified procedures are suggested to extract the spectral subspaces essential for resolution and to ascertain the number of reactions in different time domains.
Abstract: The effect of mass balance on the analysis of two-way data of reaction or process systems is investigated. It is shown that the rank-deficient species-related bilinear model can be converted to a full-rank reaction-related bilinear model, and in general situations the chemical rank for a system is the number of reactions plus one. Two slightly modified procedures are thus suggested to extract the spectral subspaces essential for resolution and to ascertain the number of reactions in different time domains. Based on the reaction-related bilinear model, a procedure of window factor analysis (WFA) is implemented for resolving the extent curves of reactions. A new two-way resolution approach, parallel vector analysis (PVA), is also developed. The idea of PVA is to construct a set of subspaces comprising only one common (spectral) component and then find a vector that is in parallel with a series of vectors coming from different subspaces. With suitably constructed subspaces the PVA procedure offers a versatile avenue to approach the unique resolution of spectral profiles. A four-component system which comprises four different processes or reactions is simulated. Results obtained reveal that favorable resolution is achieved for the spectral and concentration profiles by the suggested procedures of WFA and PVA. Copyright © 2003 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: A new procedure to select subsets in sequential dynamic systems is presented, based on the H‐principle of mathematical modeling, and therefore the aim is to balance improvement in fit and precision.
Abstract: A new procedure to select subsets in sequential dynamic systems is presented. Subsets of variables and samples to be included in a PLS model are considered. The approach is a combination of PLS analysis and standard regression methods. It is based on the H-principle of mathematical modeling, and therefore the aim is to balance improvement in fit and precision. One of the main aspects in the subset selection procedure is to keep the score space as large and as sensible as possible to gain a stable model. The procedure is described mathematically and demonstrated for a dynamic industrial case. The method is simple to apply and the motivation of the procedure is obvious for industrial applications. It can be used e.g. when modeling on-line systems. Copyright © 2003 John Wiley & Sons, Ltd.


Journal ArticleDOI
TL;DR: In this article, a family of models that deal with the problem of factor position shift in sequential data is proposed, and several versions of the quasi-ALS algorithm are described in detail.
Abstract: We previously proposed a family of models that deal with the problem of factor position shift in sequential data. We conjectured that the added information provided by fitting the shifts would make the model parameters identifiable, even for two-way data. We now derive methods of parameter estimation and give the results of experiments with synthetic data. The alternating least squares (ALS) approach is not fully suitable for estimation, because factor position shifts destroy the multilinearity of the latent structure. Therefore an alternative ‘quasi-ALS’ approach is developed, some of its practical and theoretical properties are dealt with and several versions of the quasi-ALS algorithm are described in detail. These procedures are quite computation-intensive, but analysis of synthetic data demonstrates that the algorithms can recover shifting latent factor structure and, in the situations tested, are robust against high error levels. The results of these experiments also provide strong empirical support for our conjecture that the two-way shifted factor model has unique solutions in at least some circumstances. Copyright © 2003 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: The MLPARAFAC methods are shown to produce more accurate results than PARAFAC under a variety of conditions and to be applied with greater efficiency to cases where the noise is correlated only along one mode of the three‐way array.
Abstract: Algorithms for carrying out maximum likelihood parallel factor analysis (MLPARAFAC) for three-way data are described. These algorithms are based on the principle of alternating least squares, but differ from conventional PARAFAC algorithms in that they incorporate measurement error information into the trilinear decomposition. This information is represented in the form of an error covariance matrix. Four algorithms are discussed for dealing with different error structures in the three-way array. The simplest of these treats measurements with non-uniform measurement noise which is uncorrelated. The most general algorithm can analyze data with any type of noise correlation structure. The other two algorithms are simplifications of the general algorithm which can be applied with greater efficiency to cases where the noise is correlated only along one mode of the three-way array. Simulation studies carried out under a variety of measurement error conditions were used for statistical validation of the maximum likelihood properties of the algorithms. The MLPARAFAC methods are also shown to produce more accurate results than PARAFAC under a variety of conditions. Copyright © 2003 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this article, the authors generalized the quasi-ALS algorithm for shifted factor estimation to three-way and n-way models, where mode A is the only shifted sequential mode, mode B determines shifts, and modes above B simply reweight the factors.
Abstract: The ‘quasi-ALS’ algorithm for shifted factor estimation is generalized to three-way and n-way models. We consider the case in which mode A is the only shifted sequential mode, mode B determines shifts, and modes above B simply reweight the factors. The algorithm is studied using error-free and fallible synthetic data. In addition, a four-way chromatographic data set previously analyzed by Bro et al. (J. Chemometrics 1999; 13: 295–309) is reanalyzed and (two or) three out of four factors are recovered. The reason for the incomplete success may be factor shape changes combined with the lack of distinct shift patterns for two of the factors. The shifted factor model is compared with Parafac2 from both theoretical and practical points of view. Copyright © 2003 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this paper, a field-portable, single-exposure excitation-emission matrix (EEM) fluorometer is used in conjunction with parallel factor analysis (PARAFAC) for sub-ppb polycyclic aromatic hydrocarbon (PAH) determinations in the presence of spectral interferents.
Abstract: A field-portable, single-exposure excitation–emission matrix (EEM) fluorometer is used in conjunction with parallel factor analysis (PARAFAC) for sub-ppb polycyclic aromatic hydrocarbon (PAH) determinations in the presence of spectral interferents. Several strategies for bringing multiway calibration methods such as PARAFAC into the field were explored. It was shown that automated methods of PARAFAC model selection can be as effective as manual selection. In addition, it was found that there is not always a single best model to employ for prediction. Second, the effect that reducing data density by systematically decreasing calibration set size and spectral resolution has on PARAFAC speed and prediction accuracy was investigated. By decreasing data density, the computational intensity of the PARAFAC algorithm can be reduced to increase the plausibility of on-the-fly data analysis. It was found that reducing eight sample PAH calibration sets to two or three calibration standards significantly decreased computation intensity yet generated adequate predictions. It was also found that spectral resolution can be decreased to reach an optimal compromise between calibration accuracy and analysis speed while minimizing instrumental requirements. Copyright © 2004 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this article, the authors investigated the effect of variable selection for PLS and RR on the model prediction and showed that the wavelength selection will not always give better prediction than using all of the available wavelengths.
Abstract: Standard methods for calibration of near-infrared instruments, such as partial least-squares (PLS) and ridge regression (RR), typically use the full set of wavelengths in the model. In this paper we investigate the effect of variable (wavelength) selection for these two methods on the model prediction. For RR the selection is optimized with respect to the ridge parameter, the number of variables and the configuration of the variables in the model. A fast iterative computational algorithm is developed for the purpose of this optimization. For PLS the selection is optimized with respect to the number of components, the number of variables and the configuration of the variables. We use three real data sets in this study: processed milk from the market, milk from a dairy farm and milk from the production line of a milk processing factory. The quantity of interest is the concentration of fat in the milk. The observations are randomly split into estimation and validation sets. Optimization is based on the mean square prediction error computed on the validation set. The results indicate that the wavelength selection will not always give better prediction than using all of the available wavelengths. Investigation of the information in the spectra is necessary to determine whether all of them are relevant to the objective of the model. Copyright © 2003 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this article, a novel algorithm based on coupling of the fast wavelet transform (FWT) with MLR and PLS regression techniques for the selection of optimal regression models between matrices of signals and response variables is presented: wavelet interface to linear modelling analysis (WILMA).
Abstract: A novel algorithm based on coupling of the fast wavelet transform (FWT) with MLR and PLS regression techniques for the selection of optimal regression models between matrices of signals and response variables is presented: wavelet interface to linear modelling analysis (WILMA). The algorithm decomposes each signal into the FWT domain and then, by means of proper criteria, selects the wavelet coefficients that give the best regression models, as evaluated by the leave-one-out cross-validation criterion. The predictive ability of the regression model is then checked by means of external test sets. Moreover, the signals are reconstructed back in the original domain using only the selected wavelet coefficients, to allow for chemical interpretation of the results. The algorithm was tested on different literature data sets: two near-infrared data sets from Kalivas, on which the performances of many calibration algorithms have already been tested, and a data set consisting of lead and thallium mixtures measured by differential pulse anodic stripping voltammetry and giving seriously overlapped responses. Good results were obtained for all the studied data sets; in particular, for the data sets from Kalivas the WILMA models showed improved predictive capability.

Journal ArticleDOI
TL;DR: In this paper, a method for characterizing consecutive batch reactions using self-modeling curve resolution of in situ spectroscopic measurements and reaction energy profiles is reported, which enables simple and rapid characterization of the reaction's rate of reaction, energy balance and mass balance.
Abstract: A method for fully characterizing consecutive batch reactions using self-modeling curve resolution of in situ spectroscopic measurements and reaction energy profiles is reported. Simultaneous measurement of reaction temperature, reactor jacket temperature, reactor heater power and UV/ visible spectra was made with a laboratory (50 ml capacity) batch reactor equipped with a UV/visible spectrometer and a fiber optic attenuated total reflectance (ATR) probe. Composition profiles and pure component spectra of reactants and products were estimated without the aid of reference measurements or standards from the in situ UV/visible spectra using non-negative alternating least squares (ALS), a type of self-modeling curve resolution (SMCR). Multiway SMCR analysis of consecutive batches permitted standardless comparisons of consecutive batches to determine which batch produced more or less product and which batch proceeded faster or slower. Dynamic modeling of batch energy profiles permitted mathematical resolution of the reaction dose heat and reaction heat. Kinetic fitting of the in situ reaction spectra was used to determine reaction rate constants. These three complementary approaches permitted simple and rapid characterization of the reaction's rate of reaction, energy balance and mass balance.

Journal ArticleDOI
TL;DR: In this article, two novel chemometric algorithms using the wavelet transform, namely dual-domain partial least squares (DDPLS) and dualdomain principal component regression (DDPCR), are reported.
Abstract: Taking advantage of the local nature of spectral data in both the time and frequency domains, two novel chemometric algorithms using the wavelet transform, namely dual-domain partial least squares (DDPLS) and dual-domain principal component regression (DDPCR), are reported here. The proposed algorithms establish parallel, regular models to describe spectral variation in the time (wavelength) domain. They incorporate these parallel models as a way of emphasizing local features in the frequency domain. Compared with regular PLS or PCR regression models applied to a single domain, these algorithms generate more parsimonious regression models that are also more robust against unexpected variations in the prediction set. Simulation data have been used in this paper to demonstrate this improvement. The new methods have also been successfully applied to NIR spectral data sets to predict moisture, oil, protein and starch content in Cargill corn samples, as well as a set of properties in a series of Amoco hydrocarbon samples. Through their special emphasis on the local nature of spectral signals in the frequency domain, spectral variance can be separately explained over the frequency and time domains with fewer latent variables and with better predictive performance. Copyright © 2003 John Wiley & Sons, Ltd.