scispace - formally typeset
Search or ask a question

Showing papers by "Robert Tibshirani published in 2004"


Journal ArticleDOI
TL;DR: A publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates is described.
Abstract: The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps us understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principled choice among the range of possible LARS estimates. LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates.

7,828 citations


Journal ArticleDOI
TL;DR: It is suggested that prostate tumors can be usefully classified according to their gene expression patterns, and these tumor subtypes may provide a basis for improved prognostication and treatment stratification.
Abstract: Prostate cancer, a leading cause of cancer death, displays a broad range of clinical behavior from relatively indolent to aggressive metastatic disease. To explore potential molecular variation underlying this clinical heterogeneity, we profiled gene expression in 62 primary prostate tumors, as well as 41 normal prostate specimens and nine lymph node metastases, using cDNA microarrays containing ≈26,000 genes. Unsupervised hierarchical clustering readily distinguished tumors from normal samples, and further identified three subclasses of prostate tumors based on distinct patterns of gene expression. High-grade and advanced stage tumors, as well as tumors associated with recurrence, were disproportionately represented among two of the three subtypes, one of which also included most lymph node metastases. To further characterize the clinical relevance of tumor subtypes, we evaluated as surrogate markers two genes differentially expressed among tumor subgroups by using immunohistochemistry on tissue microarrays representing an independent set of 225 prostate tumors. Positive staining for MUC1, a gene highly expressed in the subgroups with “aggressive” clinicopathological features, was associated with an elevated risk of recurrence (P = 0.003), whereas strong staining for AZGP1, a gene highly expressed in the other subgroup, was associated with a decreased risk of recurrence (P = 0.0008). In multivariate analysis, MUC1 and AZGP1 staining were strong predictors of tumor recurrence independent of tumor grade, stage, and preoperative prostate-specific antigen levels. Our results suggest that prostate tumors can be usefully classified according to their gene expression patterns, and these tumor subtypes may provide a basis for improved prognostication and treatment stratification.

1,315 citations


Journal ArticleDOI
TL;DR: The use of gene-expression profiling improves the molecular classification of adult AML and identifies new molecular subtypes of AML, including two prognostically relevant subgroups in AML with a normal karyotype.
Abstract: Background In patients with acute myeloid leukemia (AML), the presence or absence of recurrent cytogenetic aberrations is used to identify the appropriate therapy However, the current classification system does not fully reflect the molecular heterogeneity of the disease, and treatment stratification is difficult, especially for patients with intermediate-risk AML with a normal karyotype Methods We used complementary-DNA microarrays to determine the levels of gene expression in peripheral-blood samples or bone marrow samples from 116 adults with AML (including 45 with a normal karyotype) We used unsupervised hierarchical clustering analysis to identify molecular subgroups with distinct gene-expression signatures Using a training set of samples from 59 patients, we applied a novel supervised learning algorithm to devise a gene-expression–based clinical-outcome predictor, which we then tested using an independent validation group comprising the 57 remaining patients Results Unsupervised analysis identi

992 citations


Journal ArticleDOI
TL;DR: Measurement of the expression of six genes is sufficient to predict overall survival in diffuse large-B-cell lymphoma.
Abstract: background Several gene-expression signatures can be used to predict the prognosis in diffuse large-B-cell lymphoma, but the lack of practical tests for a genome-scale analysis has restricted the use of this method. methods We studied 36 genes whose expression had been reported to predict survival in diffuse large-B-cell lymphoma. We measured the expression of each of these genes in independent samples of lymphoma from 66 patients by quantitative real-time polymerasechain-reaction analyses and related the results to overall survival. results In a univariate analysis, genes were ranked on the basis of their ability to predict survival. The genes that were the strongest predictors were LMO2, BCL6, FN1, CCND2, SCYA3, and BCL2. We developed a multivariate model that was based on the expression of these six genes, and we validated the model in two independent microarray data sets. The model was independent of the International Prognostic Index and added to its predictive power. conclusions Measurement of the expression of six genes is sufficient to predict overall survival in diffuse large-B-cell lymphoma.

891 citations


Journal Article
TL;DR: An algorithm is derived that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model.
Abstract: The support vector machine (SVM) is a widely used tool for classification. Many efficient implementations exist for fitting a two-class SVM model. The user has to supply values for the tuning parameters: the regularization cost parameter, and the kernel parameters. It seems a common practice is to use a default value for the cost parameter, often leading to the least restrictive model. In this paper we argue that the choice of the cost parameter can be critical. We then derive an algorithm that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model. We illustrate our algorithm on some examples, and use our representation to give further insight into the range of SVM solutions.

699 citations


Journal ArticleDOI
TL;DR: Diagnostic procedures are presented that accurately predict the survival of future patients based on the gene expression profile and survival times of previous patients that have been successfully applied to several publicly available datasets.
Abstract: An important goal of DNA microarray research is to develop tools to diagnose cancer more accurately based on the genetic profile of a tumor There are several existing techniques in the literature for performing this type of diagnosis Unfortunately, most of these techniques assume that different subtypes of cancer are already known to exist Their utility is limited when such subtypes have not been previously identified Although methods for identifying such subtypes exist, these methods do not work well for all datasets It would be desirable to develop a procedure to find such subtypes that is applicable in a wide variety of circumstances Even if no information is known about possible subtypes of a certain form of cancer, clinical information about the patients, such as their survival time, is often available In this study, we develop some procedures that utilize both the gene expression data and the clinical data to identify subtypes of cancer and use this knowledge to diagnose future patients These procedures were successfully applied to several publicly available datasets We present diagnostic procedures that accurately predict the survival of future patients based on the gene expression profile and survival times of previous patients This has the potential to be a powerful tool for diagnosing and treating cancer

678 citations


Journal ArticleDOI
TL;DR: Least Angle Regression (LARS) as discussed by the authors is a new model selection algorithm, which is a useful and less greedy version of traditional forward selection methods such as All Subsets, Forward Selection and Backward Elimination.
Abstract: The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method;

547 citations


Journal ArticleDOI
TL;DR: Over half of the ILCs differ from IDCs not only in histological and clinical features but also in global transcription programs, and the remaining I LCs closely resemble IDCs in their transcription patterns.
Abstract: Invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC) are the two major histological types of breast cancer worldwide Whereas IDC incidence has remained stable, ILC is the most rapidly increasing breast cancer phenotype in the United States and Western Europe It is not clear whether IDC and ILC represent molecularly distinct entities and what genes might be involved in the development of these two phenotypes We conducted comprehensive gene expression profiling studies to address these questions Total RNA from 21 ILCs, 38 IDCs, two lymph node metastases, and three normal tissues were amplified and hybridized to approximately 42,000 clone cDNA microarrays Data were analyzed using hierarchical clustering algorithms and statistical analyses that identify differentially expressed genes (significance analysis of microarrays) and minimal subsets of genes (prediction analysis for microarrays) that succinctly distinguish ILCs and IDCs Eleven of 21 (52%) of the ILCs ("typical" ILCs) clustered together and displayed different gene expression profiles from IDCs, whereas the other ILCs ("ductal-like" ILCs) were distributed between different IDC subtypes Many of the differentially expressed genes between ILCs and IDCs code for proteins involved in cell adhesion/motility, lipid/fatty acid transport and metabolism, immune/defense response, and electron transport Many genes that distinguish typical and ductal-like ILCs are involved in regulation of cell growth and immune response Our data strongly suggest that over half the ILCs differ from IDCs not only in histological and clinical features but also in global transcription programs The remaining ILCs closely resemble IDCs in their transcription patterns Further studies are needed to explore the differences between ILC molecular subtypes and to determine whether they require different therapeutic strategies

418 citations


Journal ArticleDOI
TL;DR: The peak probability contrast method is a potentially useful tool for sample classification from protein mass spectrometry data and performs as well or better than several methods that require the full spectra, rather than just labelled peaks.
Abstract: Motivation: Early cancer detection has always been a major research focus in solid tumor oncology. Early tumor detection can theoretically result in lower stage tumors, more treatable diseases and ultimately higher cure rates with less treatment-related morbidities. Protein mass spectrometry is a potentially powerful tool for early cancer detection. We propose a novel method for sample classification from protein mass spectrometry data. When applied to spectra from both diseased and healthy patients, the 'peak probability contrast' technique provides a list of all common peaks among the spectra, their statistical significance and their relative importance in discriminating between the two groups. We illustrate the method on matrix-assisted laser desorption and ionization mass spectrometry data from a study of ovarian cancers. Results: Compared to other statistical approaches for class prediction, the peak probability contrast method performs as well or better than several methods that require the full spectra, rather than just labelled peaks. It is also much more interpretable biologically. The peak probability contrast method is a potentially useful tool for sample classification from protein mass spectrometry data. Supplementary Information: http://www.stat.stanford.edu/~tibs/ppc

218 citations


Journal ArticleDOI
TL;DR: Transcriptional responses in 24 genes predicted radiation toxicity in 9 of 14 patients with no false positives among 43 controls with significant heterogeneity, and may enable physicians to predict toxicity and tailor treatment for individual patients.
Abstract: Toxicity from radiation therapy is a grave problem for cancer patients. We hypothesized that some cases of toxicity are associated with abnormal transcriptional responses to radiation. We used microarrays to measure responses to ionizing and UV radiation in lymphoblastoid cells derived from 14 patients with acute radiation toxicity. The analysis used heterogeneity-associated transformation of the data to account for a clinical outcome arising from more than one underlying cause. To compute the risk of toxicity for each patient, we applied nearest shrunken centroids, a method that identifies and cross-validates predictive genes. Transcriptional responses in 24 genes predicted radiation toxicity in 9 of 14 patients with no false positives among 43 controls (P = 2.2 × 10-7). The responses of these nine patients displayed significant heterogeneity. Of the five patients with toxicity and normal responses, two were treated with protocols that proved to be highly toxic. These results may enable physicians to predict toxicity and tailor treatment for individual patients.

132 citations


01 Jan 2004
TL;DR: These SCRDA methods generalize the idea of the nearest shrunken centroids of Prediction Analysis of Microarray into the classical discriminant analysis and perform uniformly well in the multivariate classification problems, especially outperform the currently popular PAM.
Abstract: In this paper, we introduce a family of some modified versions of linear discriminant analysis, called “shrunken centroids regularized discriminant analysis” (SCRDA). These methods generalize the idea of the nearest shrunken centroids of Prediction Analysis of Microarray (PAM) into the classical discriminant analysis. These SCRDA methods are specially designed for classification problems in high dimension low sample size situations, for example microarray data. Through both simulation study and real life data, it is shown that these SCRDA methods perform uniformly well in the multivariate classification problems, especially outperform the currently popular PAM. Some of them are also suitable for feature elimination purpose and can be used as gene selection methods. The open source R codes for these methods are also available and will be added to the R libraries in the near future.

Journal ArticleDOI
TL;DR: This article exposes a class of techniques based on quadratic regularization of linear models, including regularized (ridge) regression, logistic and multinomial regression, linear and mixture discriminant analysis, the Cox model and neural networks, and shows that dramatic computational savings are possible over naive implementations.
Abstract: SUMMARY Gene expression arrays typically have 50 to 100 samples and 1000 to 20 000 variables (genes). There have been many attempts to adapt statistical models for regression and classification to these data, and in many cases these attempts have challenged the computational resources. In this article we expose a class of techniques based on quadratic regularization of linear models, including regularized (ridge) regression, logistic and multinomial regression, linear and mixture discriminant analysis, the Cox model and neural networks. For all of these models, we show that dramatic computational savings are possible over naive implementations, using standard transformations in numerical linear algebra.

Journal ArticleDOI
01 Nov 2004-Blood
TL;DR: Gene expression profiling identified AML patients with divergent prognoses within the FLT3-MU group, and the RUNX3 to ATRX expression ratio should be a useful prognostic indicator in these patients.

Journal ArticleDOI
TL;DR: Time‐dependent changes in fetal tissue gene expression in a rat model of in utero hypoxia compared with normoxic controls were investigated as an initial approach to understand molecular events underlying fetal development in response to Hypoxia.
Abstract: Molecular mechanisms underlying fetal growth restriction due to placental insufficiency and in utero hypoxia are not well understood In the current study, time-dependent (3 h-11 days) changes in fetal tissue gene expression in a rat model of in utero hypoxia compared with normoxic controls were investigated as an initial approach to understand molecular events underlying fetal development in response to hypoxia Under hypoxic conditions, litter size was reduced and IGFBP-1 was up-regulated in maternal serum and in fetal liver and heart Tissue-specific, distinct regulatory patterns of gene expression were observed under acute vs chronic hypoxic conditions Induction of glycolytic enzymes was an early event in response to hypoxia during organ development; consistently, tissue-specific induction of calcium homeostasis-related genes and suppression of growth-related genes were observed, suggesting mechanisms underlying hypoxia-related fetal growth restriction Furthermore, induction of inflammation-related genes in placentas exposed to long-term hypoxia (11 days) suggests a mechanism for placental dysfunction and impaired pregnancy outcome accompanying in utero hypoxia

Journal ArticleDOI
TL;DR: Plasma proteomic profiling with SELDI-TOF mass spectrometry provides moderate sensitivity and specificity in discriminating HNSCC, and is likely to overpredict cancer in control smokers.
Abstract: Purpose: Our study was undertaken to determine the utility of plasma proteomic profiling using surface-enhanced laser desorption/ionization time-of-flight (SELDI-TOF) mass spectrometry for the detection of head and neck squamous cell carcinomas (HNSCCs). Experimental Design: Pretreatment plasma samples from HNSCC patients or controls without known neoplastic disease were analyzed on the Protein Biology System IIc SELDI-TOF mass spectrometer (Ciphergen Biosystems, Fremont, CA). Proteomic spectra of mass:charge ratio ( m / z ) were generated by the application of plasma to immobilized metal-affinity-capture (IMAC) ProteinChip arrays activated with copper. A total of 37,356 data points were generated for each sample. A training set of spectra from 56 cancer patients and 52 controls were applied to the “Lasso” technique to identify protein profiles that can distinguish cancer from noncancer, and cross-validation was used to determine test errors in this training set. The discovery pattern was then used to classify a separate masked test set of 57 cancer and 52 controls. In total, we analyzed the proteomic spectra of 113 cancer patients and 104 controls. Results: The Lasso approach identified 65 significant data points for the discrimination of normal from cancer profiles. The discriminatory pattern correctly identified 39 of 57 HNSCC patients and 40 of 52 noncancer controls in the masked test set. These results yielded a sensitivity of 68% and specificity of 73%. Subgroup analyses in the test set of four different demographic factors (age, gender, and cigarette and alcohol use) that can potentially confound the interpretation of the results suggest that this model tended to overpredict cancer in control smokers. Conclusions: Plasma proteomic profiling with SELDI-TOF mass spectrometry provides moderate sensitivity and specificity in discriminating HNSCC. Further improvement and validation of this approach is needed to determine its usefulness in screening for this disease.

Journal ArticleDOI
TL;DR: Gene expression differences between the 2 strains suggest that aortas of C57Bl/6 mice have a higher genetic propensity to develop inflammation in response to appropriate atherogenic stimuli.
Abstract: Objective— Different strains of inbred mice exhibit different susceptibility to the development of atherosclerosis. The C3H/HeJ and C57Bl/6 mice have been used in several studies aimed at understanding the genetic basis of atherosclerosis. Under controlled environmental conditions, variations in susceptibility to atherosclerosis reflect differences in genetic makeup, and these differences must be reflected in gene expression patterns that are temporally related to the development of disease. In this study, we sought to identify the genetic pathways that are differentially activated in the aortas of these mice. Methods and Results— We performed genome-wide transcriptional profiling of aortas from C3H/HeJ and C57Bl/6 mice. Differences in gene expression were identified at baseline as well as during normal aging and longitudinal exposure to high-fat diet. The significance of these genes to the development of atherosclerosis was evaluated by observing their temporal pattern of expression in the well-studied apolipoprotein E model of atherosclerosis. Conclusion— Gene expression differences between the 2 strains suggest that aortas of C57Bl/6 mice have a higher genetic propensity to develop inflammation in response to appropriate atherogenic stimuli. This study expands the repertoire of factors in known disease-related signaling pathways and identifies novel candidate genes for future study.

Proceedings Article
01 Dec 2004
TL;DR: In this article, the authors argue that the choice of the SVM cost parameter can be critical and derive an algorithm that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model.
Abstract: In this paper we argue that the choice of the SVM cost parameter can be critical. We then derive an algorithm that can fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model.

Journal ArticleDOI
TL;DR: Discriminative margin clustering is a new technique for analyzing high dimensional quantitative datasets, specially applicable to gene expression data from microarray experiments related to cancer, which yields highly specialized tumor subtypes which are similar in terms of potential diagnostic markers.
Abstract: A central challenge in the molecular diagnosis and treatment of cancer is to define a set of molecular features that, taken together, distinguish a given cancer, or type of cancer, from all normal cells and tissues. Discriminative margin clustering is a new technique for analyzing high dimensional quantitative datasets, specially applicable to gene expression data from microarray experiments related to cancer. The goal of the analysis is find highly specialized sub-types of a tumor type which are similar in having a small combination of genes which together provide a unique molecular portrait for distinguishing the sub-type from any normal cell or tissue. Detection of the products of these genes can then, in principle, provide a basis for detection and diagnosis of a cancer, and a therapy directed specifically at the distinguishing constellation of molecular features can, in principle, provide a way to eliminate the cancer cells, while minimizing toxicity to any normal cell. The new methodology yields highly specialized tumor subtypes which are similar in terms of potential diagnostic markers.

Proceedings ArticleDOI
16 Aug 2004
TL;DR: The motivation for boosted PRIM is to solve the problem of "searching for oncogenic pathways" based on array-CGH data, though the algorithm itself is suitable for general classification problems.
Abstract: Boosted PRIM (patient rule induction method) is a new algorithm developed for two-class classification problems. PRIM is a variation of those tree-based methods, seeking box-shaped regions in the feature space to separate different classes. Boosted PRIM is to implement PRIM-styled weak learners in Adaboost, one of the most popular boosting algorithms. In addition, we improve the performance of the algorithm by introducing a regularization to the boosting process, which supports the perspective of viewing boosting as a steepest-descent numerical optimization by Jerry Friedman. The motivation for boosted PRIM is to solve the problem of "searching for oncogenic pathways" based on array-CGH (comparative genomic hybridization) data, though the algorithm itself is suitable for general classification problems. We illustrate the performance of the method through some simulation studies as well as an application on a lung cancer array-CGH data set.

Journal ArticleDOI
TL;DR: In this article, Jiang et al. discuss process consistency for AdaBoost and the Bayes-risk consistency of regularized boosting methods, including convex risk minimization, and statistical behavior and consistency of classification methods.
Abstract: Discussions of: "Process consistency for AdaBoost" [Ann. Statist. 32 (2004), no. 1, 13-29] by W. Jiang; "On the Bayes-risk consistency of regularized boosting methods" [ibid., 30-55] by G. Lugosi and N. Vayatis; and "Statistical behavior and consistency of classification methods based on convex risk minimization" [ibid., 56-85] by T. Zhang. Includes rejoinders by the authors.

Journal Article
TL;DR: A short paper on the organization of queues for coronary surgery brings to mind H.L. Mencken's tag that every complex problem has a neat, simple solution — and it is wrong.
Abstract: Gerry B. Hill's short paper on the organization of queues for coronary surgery (page 354)[1][1] brings to mind H.L. Mencken's tag that every complex problem has a neat, simple solution — and it is wrong. For busy readers, in the recent tradition of 4-word movie reviews,[2][2] we offer a 6-word

Journal ArticleDOI
16 Nov 2004-Blood
TL;DR: It is hypothesized that gene expression profiles would identify genes that cooperate with FLT3 mutations in conferring poor clinical outcome, and it is observed that patients with normal karyotypes who were enrolled in the Pediatric Oncology Group (POG) study #9421 had two significantly different clinical outcomes that were associated with the expression of FLT 3 mutations.