scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Chemometrics in 2004"


Journal ArticleDOI
TL;DR: In this tutorial, traditional decision tree construction and the current state of decision tree modeling are reviewed and emphasis is placed on techniques that make decision trees well suited to handle the complexities of chemical and biochemical applications.
Abstract: In this tutorial, traditional decision tree construction and the current state of decision tree modeling are reviewed. Emphasis is placed on techniques that make decision trees well suited to handle the complexities of chemical and biochemical applications. Copyright © 2004 John Wiley & Sons, Ltd.

643 citations


Journal ArticleDOI
TL;DR: For the data presented here dynamic time warping with rigid slope constraints and correlation optimized warping are superior to unconstrained dynamic time Warping; both considerably simplify interpretation of the factor model results.
Abstract: Two different algorithms for time-alignment as a preprocessing step in linear factor models are studied. Correlation optimized warping and dynamic time warping are both presented in the literature as methods that can eliminate shift-related artifacts from measurements by correcting a sample vector towards a reference. In this study both the theoretical properties and the practical implications of using signal warping as preprocessing for chromatographic data are investigated. The connection between the two algorithms is also discussed. The findings are illustrated by means of a case study of principal component analysis on a real data set, including manifest retention time artifacts, of extracts from coffee samples stored under different packaging conditions for varying storage times. We concluded that for the data presented here dynamic time warping with rigid slope constraints and correlation optimized warping are superior to unconstrained dynamic time warping; both considerably simplify interpretation of the factor model results. Unconstrained dynamic time warping was found to be too flexible for this chromatographic data set, resulting in an overcompensation of the observed shifts and suggesting the unsuitability of this preprocessing method for this type of signals. Copyright © 2004 John Wiley & Sons, Ltd.

604 citations


Journal ArticleDOI
TL;DR: A modification of interval partial least squares (iPLS), designated backward interval PLS (biPLS) is developed and studied such that it can detect and remove the least relevant regions, thereby reducing the search domain to a size that GAs can handle easily as mentioned in this paper.
Abstract: It is nowadays widely accepted that genetic algorithms (GAs) are powerful tools in variable selection and that after suitable modifications they can also be powerful in detecting the most relevant spectral regions for multivariate calibration. One of the main limitations of GAs is related to the fact that when spectral intensities are measured at a very large number of wavelengths the search domain increases correspondingly and therefore the detection of the relevant regions is much more difficult. A modification of interval partial least squares (iPLS), designated backward interval PLS (biPLS), is developed and studied such that it can detect and remove the least relevant regions, thereby reducing the search domain to a size that GAs can handle easily. In this paper the application to two different spectroscopic data sets will be shown: infrared spectroscopic analysis of polymer film additives and determination of the contents of erucic acid and total fatty acids in brassica seeds by near-infrared spectroscopy. The developed method is compared with model performances based on expert selection of variables as well as with results from application of the previously developed GA-PLS method. The sequential application of biPLS and GA-PLS has proven successful, and comparable or better results have been obtained, introducing a more automatic region selection procedure and a substantial decrease in computation time. Copyright © 2005 John Wiley & Sons, Ltd.

359 citations


Journal ArticleDOI
TL;DR: In this paper, a non-negativity-constrained least square (NNLS) algorithm for large-scale MCR and other ALS applications is presented. But the algorithm is not suitable for solving large numbers of observation vectors.
Abstract: Algorithms for multivariate image analysis and other large-scale applications of multivariate curve resolution (MCR) typically employ constrained alternating least squares (ALS) procedures in their solution. The solution to a least squares problem under general linear equality and inequality constraints can be reduced to the solution of a non-negativity-constrained least squares (NNLS) problem. Thus the efficiency of the solution to any constrained least square problem rests heavily on the underlying NNLS algorithm. We present a new NNLS solution algorithm that is appropriate to large-scale MCR and other ALS applications. Our new algorithm rearranges the calculations in the standard active set NNLS method on the basis of combinatorial reasoning. This rearrangement serves to reduce substantially the computational burden required for NNLS problems having large numbers of observation vectors.

257 citations


Journal ArticleDOI
TL;DR: The results obtained from a simulation study showed that MCCV has an obviously larger probability than leave‐one‐out CV (LOO‐CV) of selecting the model with best prediction ability and that a corrected MccV (CMCCV) could give a more accurate estimation of prediction ability than LOO‐ CV or MCCVs.
Abstract: A new simple and effective method named Monte Carlo cross validation (MCCV) has been introduced and evaluated for selecting a model and estimating the prediction ability of the model selected. Unlike the leave-one-out procedure widely used in chemometrics for cross-validation (CV), the Monte Carlo cross-validation developed in this paper is an asymptotically consistent method of model selection. It can avoid an unnecessarily large model and therefore decreases the risk of overfitting of the model. The results obtained from a simulation study showed that MCCV has an obviously larger probability than leave-one-out CV (LOO-CV) of selecting the model with best prediction ability and that a corrected MCCV (CMCCV) could give a more accurate estimation of prediction ability than LOO-CV or MCCV. The results obtained with real data sets demonstrated that MCCV could successfully select an appropriate model and that CMCCV could assess the prediction ability of the selected model with satisfactory accuracy. Copyright © 2004 John Wiley & Sons, Ltd.

223 citations


Journal ArticleDOI
TL;DR: In this paper, the authors compared several competing mean squared error of prediction (MSEP) estimators on principal components regression (PCR) and partial least squares regression (PLSR): leave-one-out crossvalidation, K-fold and adjusted k-fold cross-validation.
Abstract: The paper presents results from simulations based on real data, comparing several competing mean squared error of prediction (MSEP) estimators on principal components regression (PCR) and partial least squares regression (PLSR): leave-one-out crossvalidation, K-fold and adjusted K-fold cross-validation, the ordinary bootstrap estimate, the bootstrap smoothed cross-validation (BCV) estimate and the 0.632 bootstrap estimate. The overall performance of the estimators is compared in terms of their bias, variance and squared error. The results indicate that the 0.632 estimate and leave-one-out crossvalidation are preferable when one can afford the computation. Otherwise adjusted 5- or 10-fold cross-validation are good candidates because of their computational efficiency.

222 citations


Journal ArticleDOI
TL;DR: A new method for multivariate classification, support vector machines (SVM), was compared with that of two classical chemometric methods, partial least squares (PLS) and artificial neural networks (ANN), in classifying feed particles as either MBM or vegetal using the spectra from NIR images.
Abstract: This study concerns the development of a new system to detect meat and bone meal (MBM) in compound feeds, which will be used to enforce legislation concerning feedstuffs enacted after the European mad cow crisis. Focal plane array near-infrared (NIR) imaging spectroscopy, which collects thousands of spatially resolved spectra in a massively parallel fashion, has been suggested as a more efficient alternative to the current methods, which are tedious and require significant expert human analysis. Chemometric classification strategies have been applied to automate the method and reduce the need for constant expert analysis of the data. In this work the performance of a new method for multivariate classification, support vector machines (SVM), was compared with that of two classical chemometric methods, partial least squares (PLS) and artificial neural networks (ANN), in classifying feed particles as either MBM or vegetal using the spectra from NIR images. While all three methods were able to effectively model the data, SVM was found to perform substantially better than PLS and ANN, exhibiting a much lower rate of false positive detection. Copyright © 2004 John Wiley & Sons, Ltd.

163 citations


Journal ArticleDOI
TL;DR: In this article, the authors investigated the properties of different types of ensemble methods used with PLSR in situations with highly collinear x-data and found that ensembles trained on data with added noise can make pLSR robust against the type of noise added.
Abstract: Recently, there has been increased attention in the literature on the use of ensemble methods in multivariate regression and classification. These methods have been shown to have interesting properties for both regression and classification. In particular, they can improve the accuracy of unstable predictors. Ensemble methods have so far been little studied in situations that are common for calibration and prediction in chemistry, i.e. situations with a large number of collinear x-variables and few samples. These situations are often approached by data compression methods such as principal component regression (PCR) or partial least squares regression (PLSR). The present paper is an investigation of the properties of different types of ensemble methods used with PLSR in situations with highly collinear x-data. Bagging and data augmentation by simulated noise are studied. The focus is on the robustness of the calibrations. Real and simulated data are used. The results show that ensembles trained on data with added noise can make PLSR robust against the type of noise added. In particular, the effects of sample temperature variations can be eliminated. Bagging does not seem to give any improvement over PLSR for small and intermediate numbers of components. It is, however, less sensitive to overfitting. Copyright © 2005 John Wiley & Sons, Ltd.

80 citations



Journal ArticleDOI
TL;DR: In this paper, a large data set containing 735 carcinogenic activities and 1355 descriptors was used to model the structure-carcinogenic activity of drugs, and a correlation ranking and a genetic algorithm were employed for selecting the best set of principal components (PCs).
Abstract: The major problem associated with principal component regression (PCR), especially in QSAR studies, is that this model extracts the eigenvectors solely from the matrix of predictor variables and therefore they might not have an essentially good relationship with the predicted variable. This paper describes the application of PCR to model the structure–carcinogenic activity of drugs. To obtain the optimal model, correlation ranking and a genetic algorithm were employed for selecting the best set of principal components (PCs). A large data set containing 735 carcinogenic activities and 1355 descriptors was used. Two cross-validation procedures (leave-many-out and ν-fold cross-validation) and the hold-out-a-test-sample (HOTS) method were used to validate the models. It was found that introduction of PCs by the conventional eigenvalue ranking procedure did not produce the perfect model. Instead, factor selection by correlation ranking and genetic algorithm produced good models of similar quality. The models could explain more than 80% of the variances in carcinogenic activity. Copyright © 2005 John Wiley & Sons, Ltd.

67 citations


Journal ArticleDOI
Johan Trygg1
TL;DR: In this article, direct and indirect calibration have been compared with respect to both prediction and model interpretation, including their ability to estimate the pure spectral profile of each known constituent of a known constitu...
Abstract: Direct and indirect calibration have been compared with respect to both prediction and model interpretation. This included their ability to estimate the pure spectral profile of each known constitu ...

Journal ArticleDOI
TL;DR: In this article, a closed-form CANDECCANDECOMP/PARAFAC solution is derived for 5 x 3 x 3 arrays when these happen to have rank 5 and any two different solutions share four of the five components.
Abstract: A key property of CANDECOMP/PARAFAC is the essential uniqueness it displays under certain conditions. It has been known for a long time that, when these conditions are not met, partial uniqueness may remain. Whereas considerable progress has been made in the study of conditions for uniqueness, the study of partial uniqueness has lagged behind. The only well known cases are those of overfactoring, when more components are extracted than are required for perfect fit, and those cases where the data do not have enough system variation, resulting in proportional components for one or more modes. The present paper deals with partial uniqueness in cases where the smallest number of components is extracted that yield perfect fit. For the case of K x K x 2 arrays of rank K, randomly sampled from a continuous distribution, it is shown that partial uniqueness, with some components unique and others differing between solutions, arises with probability zero. Also a closed-form CANDECOMP/PARAFAC solution is derived for 5 x 3 x 3 arrays when these happen to have rank 5. In such cases, any two different solutions share four of the five components. This phenomenon will be traced back to a sixth degree polynomial having six real roots, any five of which can be picked to construct a solution. Copyright (C) 2004 John Wiley Sons, Ltd.

Journal ArticleDOI
TL;DR: Various genes with unknown function and ESTs, found to be important in discriminating genes for colon, leukaemia, renal and central nervous system tumour cells, are indicated as deserving high priority in future molecular studies.
Abstract: Partial least squares discriminant analysis (PLS-DA) provides a sound statistical basis for the selection, from an original 9605-data set, of a limited number of gene transcripts most effective in discriminating different tumour histotypes. The potentialities of the PLS-DA approach are pointed out by its ability to identify genes which, according to current knowledge, are associated with cancer development. Moreover, PLS-DA was able to identify MUC 13 and S100P proteins as candidates for the development of new colon cancer diagnostics. Various genes with unknown function and ESTs (expressed sequence tags), found to be important in discriminating genes for colon, leukaemia, renal and central nervous system tumour cells, are indicated as deserving high priority in future molecular studies. Copyright © 2004 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this paper, an iterative combination of partial least squares and ordinary least squares (OLS) is proposed to analyze the relationship between a set of explanatory variables and one or more responses.
Abstract: In many situations one performs designed experiments to find the relationship between a set of explanatory variables and one or more responses. Often there are other factors that influence the results in addition to the factors that are included in the design. To obtain information about these so-called nuisance factors, one can sometimes measure them using spectroscopic methods. The question then is how to analyze this kind of data, i.e. a combination of an orthogonal design matrix and a spectroscopic matrix with hundreds of highly collinear variables. In this paper we introduce a method that is an iterative combination of partial least squares (PLS) and ordinary least squares (OLS) and compare its performance with other methods such as direct PLS, OLS and a combination of principal component analysis and least squares. The methods are compared using two real data sets and using simulated data. The results show that the incorporation of external information from spectroscopic measurements gives more information from the experiment and lower variance in the parameter estimates. We also find that the introduced algorithm separates the information from the spectral and design matrices in a nice way. It also has some advantages over PLS in showing lower bias and being less influenced by the relative weighting of the design and spectroscopic variables. Copyright © 2005 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: The use of multivariate design to ensure representativity and balance of the training set data for PLS multivariate modeling and quantitative structure activity relationships in medicinal and pharmaceutical chemistry, and data mining is discussed.
Abstract: We discuss the use of multivariate design to ensure representativity and balance of the training set data for PLS multivariate modeling. Three application areas are used to illustrate the discussio ...

Journal ArticleDOI
TL;DR: In this article, different approaches for the calculation of prediction intervals of estimations obtained in multivariate curve resolution using alternating least squares optimization methods are explored and compared, including Monte Carlo simulations, noise addition and jackknife resampling.
Abstract: Different approaches for the calculation of prediction intervals of estimations obtained in multivariate curve resolution using alternating least squares optimization methods are explored and compared. These methods include Monte Carlo simulations, noise addition and jackknife resampling. Obtained results allow a preliminary investigation of noise effects and error propagation on resolved profiles and on parameters estimated from them. The effect of noise on rotational ambiguities frequently found in curve resolution methods is discussed. This preliminary study is shown for the resolution of a three-component equilibrium system with overlapping concentration and spectral profiles. Copyright © 2004 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this article, the orthogonal projections to latent structures (O-PLS) method was used for preprocessing and filtering of near-infrared reflectance spectra of soil samples.
Abstract: The objective of this paper is to present new properties of the orthogonal projections to latent structures (O-PLS) method developed by Trygg and Wold (J. Chemometrics 2002; 16: 119–128). The original orthogonal signal correction (OSC) filter of Wold et al. (Chemometrics Intell. Lab. Syst. 1998; 44: 175–185) removes systematic variation from X that is unrelated to Y. O-PLS is a more restrictive OSC filter. O-PLS removes only systematic variation in X explained in each PLS component that is not correlated with Y. O-PLS is a slight modification of the NIPALS PLS algorithm, which should make O-PLS a generally applicable preprocessing and filtering method. The computation of the O-PLS components under the constraint of being correlated with one PLS component imposes particular properties on the space spanned by the O-PLS components. This paper is divided into two main sections. First we give an application of O-PLS on near-infrared reflectance spectra of soil samples, showing some graphical properties. Then we give the mathematical justifications of these properties. Copyright © 2004 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this article, a bootstrap procedure is proposed that produces percentile intervals for all output parameters, which indicate the instability of the sample solutions and can be interpreted as confidence intervals for the output parameters.
Abstract: Results from exploratory three-way analysis techniques such as CANDECOMP/PARAFAC and Tucker3 analysis are usually presented without giving insight into uncertainties due to sampling. Here a bootstrap procedure is proposed that produces percentile intervals for all output parameters. Special adjustments are offered for handling the non-uniqueness of the solutions. The percentile intervals indicate the instability of the sample solutions. By means of a simulation study it is demonstrated that the percentile intervals can fairly well be interpreted as confidence intervals for the output parameters. Copyright (C) 2004 John Wiley Sons, Ltd.

Journal ArticleDOI
TL;DR: In this article, a set of principal properties (PPs) for coded amino acids from GRID maps and experimental data were derived. And the new parameters are further used to develop modified auto-and cross-covariance transforms which appear to be even more suitable for the stated goals, as they label each peptide according to its bonding capabilities.
Abstract: The paper derives a new set of principal properties (PPs) for coded amino acids from GRID maps and experimental data. The three scales characterize side chains according to their polarity (PP1), size/hydrophobicity (PP2) and H-bonding capability (PP3) and can be used profitably both for describing and designing peptide series. The new parameters are further used to develop modified auto- and cross-covariance transforms which appear to be even more suitable for the stated goals, as they label each peptide according to its bonding capabilities. Copyright © 2004 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: A newly developed modification of multivariate statistical process control charts based on on‐line NIR allows easy and efficient identification of abnormal fermentation runs, even at an early stage of the fermentation, which is critical for industrial production monitoring.
Abstract: On-line near infrared (NIR) spectroscopy enables detailed monitoring of many important types of industrial fermentation processes to document progress and to ensure comparable end-product quality. In this work a newly developed modification of multivariate statistical process control (MSPC) charts based on on-line NIR allows easy and efficient identification of abnormal fermentation runs, even at an early stage of the fermentation, which is critical for industrial production monitoring. This study especially focuses on alignment of cultivations through usage of three different time concepts: absolute time, relative time and ‘biological process time’. For the investigated batch fermentation the absolute time concept for alignment is not acceptable. Using the relative time concept or the ‘biological process time’ concept gave similar possibilities for early warning of abnormal batch runs for completed runs. For on-line use, however, the relative time concept cannot be used, which is the reason for our development and introduction of a new ‘biological process time’ concept. Copyright © 2004 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this article, the Jacobian matrix was used to calculate prediction intervals via a local linearization of the PLS estimator, thus making the calculation of prediction intervals more practicable.
Abstract: Several algorithms to calculate the vector of regression coefficients and the Jacobian matrix for partial least squares regression have been published Whereas many efficient algorithms to calculate the regression coefficients exist, algorithms to calculate the Jacobian matrix are inefficient Here we introduce a new, efficient algorithm for the Jacobian matrix, thus making the calculation of prediction intervals via a local linearization of the PLS estimator more practicable Copyright © 2004 John Wiley & Sons, Ltd

Journal ArticleDOI
TL;DR: In this paper, alternative multivariate linear regression methods conceived to take into account data uncertainties are evaluated under simulation scenarios that cover different noise and data structures, and the results thus obtained provide guidelines on which methods to use and when.
Abstract: With the development of measurement instrumentation methods and metrology, one is very often able to rigorously specify the uncertainty associated with each measured value (e.g. concentrations, spectra, process sensors). The use of this information, along with the corresponding raw measurements, should, in principle, lead to more sound ways of performing data analysis, since the quality of data can be explicitly taken into account. This should be true, in particular, when noise is heteroscedastic and of a large magnitude. In this paper we focus on alternative multivariate linear regression methods conceived to take into account data uncertainties. We critically investigate their prediction and parameter estimation capabilities and suggest some modifications of well-established approaches. All alternatives are tested under simulation scenarios that cover different noise and data structures. The results thus obtained provide guidelines on which methods to use and when. Interestingly enough, some of the methods that explicitly incorporate uncertainty information in their formulations tend to present not as good performances in the examples studied, whereas others that do not do so present an overall good performance. Copyright © 2005 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this article, the L-curve harmonious approach is proposed to find the optimal ridge parameter value, where λ ≥ 0, by inspecting the bias/variance tradeoff.
Abstract: A critical component of ridge regression (RR) is determining the optimal ridge parameter value, λ, where λ≥0. Improper selection of λ not only generates an under- or overfitted model but also leads to incorrect conclusions in inter-model comparison studies such as between RR, PLS, PCR and other modeling methods. Several methods for determining the optimal RR model are evaluated in this paper. For example, the commonly used ridge trace is identified as subjective and impractical. A direct calculation method from the literature yields over- or underfitted RR models with λ either too small or to large respectively. Methods for determining λ based on a harmonious approach are discussed. The harmonious approach optimizes λ by inspecting the bias/variance tradeoff. Of the methods investigated, plotting a variance indicator against a bias measure to yield an L-curve appears not only to simplify selection of λ but also to reduce the chance of obtaining an under- or overfitted RR model. It is shown with four data sets that the L-curve harmonious approach consistently provides good models. The effective rank of models is also discussed in conjunction with the harmony/parsimony tradeoff. Copyright © 2005 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this paper, the authors present a general approach based on an experimental design in factors that are possible to control, a set up of raw material in blocks in combination with multivariate measurements (using FT-IR) of the raw material.
Abstract: To be able to control an industrial process, it is necessary to know the relationship between raw materials, process settings and end-product results. In many situations the raw materials are highly complex and difficult to vary in a systematic way. This makes the use of standard experimental design techniques with a systematic variation in the variables difficult. To solve this, one might measure the raw materials at hand, but then the problem is to know what to measure. In this paper we present a general approach for such situations based on an experimental design in factors that are possible to control, a set up of raw material in blocks in combination with multivariate measurements (using FT-IR) of the raw material. To analyse the results, we include these measurements as principal components of the spectra. The usefulness of the approach is demonstrated with an example from cheese production. It is shown that it is possible to obtain a model for the amount of ‘cheese fines’ (a yield loss parameter) based on this approach. The final model contains readily measurable information about the raw materials but is obtained without any prior hypothesis about their contribution. Copyright © 2004 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: The onion design is used to select an informative training set to be used in conjunction with QSAR, combinatorial technologies and other areas of research depending on optimized molecular properties.
Abstract: Statistical molecular design (SMD) is an efficient tool for selecting informative, representative and diverse sets of molecular structures to be used in conjunction with QSAR, combinatorial technologies and other areas of research depending on optimization of molecular properties. Onion design represents a recent addition to the plethora of designs encountered in the SMD toolbox. It is a flexible design approach relying on a combination of the best properties of other design families, notably the model support property of D-optimal design and the uniform coverage ability of space-filling design. The onion design splits the candidate set into a number of subsets (‘shells’ or ‘layers’), and a D-optimal selection is made from each shell. This makes it possible to select representative sets of molecular structures throughout any property space with reasonable design sizes. The number of selected molecules is easily controlled by varying (i) the number of shells and (ii) the model on which the design is based. The applicability of onion design to a pharmaceutical QSAR problem is reported. The example data set contains 967 drug-like molecules. The biological activity under investigation is the inhibition of the major human drug-metabolizing enzyme cytochrome P450 3A4. Onion design is used to select an informative training set. QSAR modeling is accomplished by means of multivariate data analysis tools. Copyright © 2004 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this paper, the authors present measures of effective rank for a given model that can be applied to all modeling methods, thereby providing inter-model comparisons and assessing the true nature of variable selection for improved parsimony can be properly assessed.
Abstract: In order to determine the proper multivariate calibration model, it is necessary to select the number of respective basis vectors (latent vectors, factors, etc.) when using principal component regression (PCR) or partial least squares (PLS). These values are commonly referred to as the prediction rank of the model. Comparisons between PCR and PLS models for a given data set are often made with the prediction rank to determine the more parsimonious model, ignoring the fact that the values have been obtained using different basis sets. Additionally, it is not possible to use this approach for determining the prediction rank of models generated by other modeling methods such as ridge regression (RR). This paper presents measures of effective rank for a given model that can be applied to all modeling methods, thereby providing inter-model comparisons. A definition based on the regression vector norm and is compared with two alternative forms from the literature. With a proper definition of effective rank, a better assessment of degrees of freedom for statistical computations is possible. Additionally, the true nature of variable selection for improved parsimony can be properly assessed. Spectroscopic data sets are used as examples with PCR, PLS and RR. Copyright © 2004 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this article, a variable selection method for classification problems where the X data are discretely sampled from continuous curves is presented, where the loading weight vectors of a PLS discriminant analysis inherit the continuous behaviour.
Abstract: In this paper we present a new variable selection method designed for classification problems where the X data are discretely sampled from continuous curves. For such data the loading weight vectors of a PLS discriminant analysis inherit the continuous behaviour, making the idea of local peaks meaningful. For successive components the local peaks are checked for importance before entering the set of selected variables. Our examples with NIR/NIT show that substantial simplification of the X space can be obtained without loss of classification power when compared with ‘benchmark full-spectrum’ methods. Copyright © 2004 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: This paper describes a procedure where a SVM classifier is constructed with support vectors systematically retrieved from the pool of unlabelled samples and provides perfect classification without further labelling, giving the same outcome as most classical techniques.
Abstract: Labelling samples is a procedure that may result in significant delays particularly when dealing with larger datasets and/or when labelling implies prolonged analysis. In such cases a strategy that allows the construction of a reliable classifier on the basis of a minimal sized training set by labelling a minor fraction of samples can be of advantage. Support vector machines (SVMs) are ideal for such an approach because the classifier relies on only a small subset of samples, namely the support vectors, while being independent from the remaining ones that typically form the majority of the dataset. This paper describes a procedure where a SVM classifier is constructed with support vectors systematically retrieved from the pool of unlabelled samples. The procedure is termed ‘active’ because the algorithm interacts with the samples prior to their labelling rather than waiting passively for the input. The learning behaviour on simulated datasets is analysed and a practical application for the detection of hydrocarbons in soils using mass spectrometry is described. Results on simulations show that the active learning SVM performs optimally on datasets where the classes display an intermediate level of separation. On the real case study the classifier correctly assesses the membership of all samples in the original dataset by requiring for labelling around 14% of the data. Its subsequent application on a second dataset of analogous nature also provides perfect classification without further labelling, giving the same outcome as most classical techniques based on the entirely labelled original dataset. Copyright © 2004 John Wiley & Sons, Ltd.

Journal ArticleDOI
TL;DR: In this article, the authors give an overview of simplicity transformations and typical rank results, which are suitable to determine whether or not certain constrained core arrays are trivial in terms of their complexity.
Abstract: In chemometric applications of Tucker three-way principal component analysis, core arrays are often constrained to have a large majority of zero elements. This gives rise to questions of non-triviality (are the constraints active, or can any core of a given format be transformed to satisfy the constraints?) and uniqueness (can we transform the components in one or more directions without losing the given pattern of zero elements in the core?). Rather than deciding such questions on an ad hoc basis, general principles are to be preferred. This paper gives an overview of simplicity transformations on the one hand, and typical rank results on the other, which are suitable to determine whether or not certain constrained cores are trivial. Copyright (C) 2004 John Wiley Sons, Ltd.

Journal ArticleDOI
TL;DR: A first-order perturbation analysis of the least squares approximation of a given higher-order tensor by a tensor having prespecified n-mode ranks is performed.
Abstract: In this paper we perform a first-order perturbation analysis of the least squares approximation of a given higher-order tensor by a tensor having prespecified n-mode ranks. This work generalizes the classical first-order perturbation analysis of the matrix singular value decomposition. We will show that there are important differences between the matrix and the higher-order tensor case. We subsequently address (1) the best rank-1 approximation of supersymmetric tensors, (2) the best rank-(R 1 , R 2 , R 3 ) approximation of arbitrary tensors and (3) the best rank-(R 1 , R 2 , R 3 ) approximation of arbitrary tensors.