scispace - formally typeset
Open AccessJournal ArticleDOI

Tumor classification by partial least squares using microarray gene expression data

Danh V. Nguyen, +1 more
- 01 Jan 2002 - 
- Vol. 18, Iss: 1, pp 39-50
Reads0
Chats0
TLDR
A novel analysis procedure for classifying (predicting) human tumor samples based on microarray gene expressions is proposed and PLS proves superior to the well known dimension reduction method of Principal Components Analysis (PCA).
Abstract
Motivation: One important application of gene expression microarray data is classification of samples into categories, such as the type of tumor. The use of microarrays allows simultaneous monitoring of thousands of genes expressions per sample. This ability to measure gene expression en masse has resulted in data with the number of variables p (genes) far exceeding the number of samples N . Standard statistical methodologies in classification and prediction do not work well or even at all when N < p. Modification of existing statistical methodologies or development of new methodologies is needed for the analysis of microarray data. Results: We propose a novel analysis procedure for classifying (predicting) human tumor samples based on microarray gene expressions. This procedure involves dimension reduction using Partial Least Squares (PLS) and classification using Logistic Discrimination (LD) and Quadratic Discriminant Analysis (QDA). We compare PLS to the well known dimension reduction method of Principal Components Analysis (PCA). Under many circumstances PLS proves superior; we illustrate a condition when PCA particularly fails to predict well relative to PLS. The proposed methods were applied to five different microarray data sets involving various human tumor samples: (1) normal versus ovarian tumor; (2) Acute Myeloid Leukemia (AML) versus Acute Lymphoblastic Leukemia (ALL); (3) Diffuse Large B-cell Lymphoma (DLBCLL) versus B-cell Chronic Lymphocytic Leukemia (BCLL); (4) normal versus colon tumor; and (5) Non-SmallCell-Lung-Carcinoma (NSCLC) versus renal samples. Stability of classification results and methods were further assessed by re-randomization studies. Availability: The methodology can be implemented using a combination of standard statistical methods, available, for example, in SAS. Illustrative SAS code is available from the first author.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Random forest: a classification and regression tool for compound classification and QSAR modeling.

TL;DR: It is the combination of relatively high prediction accuracy and its collection of desired features that makes Random Forest uniquely suited for modeling in cheminformatics.
Journal ArticleDOI

Minimum redundancy feature selection from microarray gene expression data.

TL;DR: How to selecting a small subset out of the thousands of genes in microarray data is important for accurate classification of phenotypes.
Journal ArticleDOI

mixOmics: An R package for 'omics feature selection and multiple data integration

TL;DR: MixOmics is introduced, an R package dedicated to the multivariate analysis of biological data sets with a specific focus on data exploration, dimension reduction and visualisation and extends Projection to Latent Structure models for discriminant analysis.
Journal ArticleDOI

The pls Package: Principal Component and Partial Least Squares Regression in R

TL;DR: The pls package implements principal component regression (PCR) and partial least squares regression (PLSR) in R and is freely available from the Comprehensive R Archive Network (CRAN), licensed under the GNU General Public License (GPL).
Proceedings ArticleDOI

Minimum redundancy feature selection from microarray gene expression data

TL;DR: Feature sets obtained through the minimum redundancy - maximum relevance framework represent broader spectrum of characteristics of phenotypes than those obtained through standard ranking methods; they are more robust, generalize well to unseen data, and lead to significantly improved classifications in extensive experiments on 5 gene expressions data sets.
References
More filters
Book ChapterDOI

Nonparametric Estimation from Incomplete Observations

TL;DR: In this article, the product-limit (PL) estimator was proposed to estimate the proportion of items in the population whose lifetimes would exceed t (in the absence of such losses), without making any assumption about the form of the function P(t).
Book

The Nature of Statistical Learning Theory

TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Book

Applied Logistic Regression

TL;DR: Hosmer and Lemeshow as discussed by the authors provide an accessible introduction to the logistic regression model while incorporating advances of the last decade, including a variety of software packages for the analysis of data sets.
Journal ArticleDOI

Applied Logistic Regression.

TL;DR: Applied Logistic Regression, Third Edition provides an easily accessible introduction to the logistic regression model and highlights the power of this model by examining the relationship between a dichotomous outcome and a set of covariables.
Book

Generalized Linear Models

TL;DR: In this paper, a generalization of the analysis of variance is given for these models using log- likelihoods, illustrated by examples relating to four distributions; the Normal, Binomial (probit analysis, etc.), Poisson (contingency tables), and gamma (variance components).
Related Papers (5)