scispace - formally typeset
Search or ask a question

Showing papers by "Robert Tibshirani published in 2006"


Journal ArticleDOI
TL;DR: This work introduces a new method called sparse principal component analysis (SPCA) using the lasso (elastic net) to produce modified principal components with sparse loadings and shows that PCA can be formulated as a regression-type optimization problem.
Abstract: Principal component analysis (PCA) is widely used in data processing and dimensionality reduction. However, PCA suffers from the fact that each principal component is a linear combination of all the original variables, thus it is often difficult to interpret the results. We introduce a new method called sparse principal component analysis (SPCA) using the lasso (elastic net) to produce modified principal components with sparse loadings. We first show that PCA can be formulated as a regression-type optimization problem; sparse loadings are then obtained by imposing the lasso (elastic net) constraint on the regression coefficients. Efficient algorithms are proposed to fit our SPCA models for both regular multivariate data and gene expression arrays. We also give a new formula to compute the total variance of modified principal components. As illustrations, SPCA is applied to real and simulated data with encouraging results.

3,102 citations


Journal ArticleDOI
TL;DR: Supervised Principal Component Analysis (SPCA) as mentioned in this paper is similar to conventional principal components analysis except that it uses a subset of the predictors selected based on their association with the outcome, and can be applied to regression and generalized regression problems, such as survival analysis.
Abstract: In regression problems where the number of predictors greatly exceeds the number of observations, conventional regression techniques may produce unsatisfactory results. We describe a technique called supervised principal components that can be applied to this type of problem. Supervised principal components is similar to conventional principal components analysis except that it uses a subset of the predictors selected based on their association with the outcome. Supervised principal components can be applied to regression and generalized regression problems, such as survival analysis. It compares favorably to other techniques for this type of problem, and can also account for the effects of other covariates and help identify which predictor variables are most important. We also provide asymptotic consistency results to help support our empirical findings. These methods could become important tools for DNA microarray data, where they may be used to more accurately diagnose and treat cancer.

773 citations


Journal ArticleDOI
TL;DR: In this paper, the problem of identifying differentially expressed groups of genes from a microarray experiment is discussed and two potential improvements to GSEA are proposed: the max-mean statistic for summarizing gene-sets, and restandardization for more accurate inferences.
Abstract: This paper discusses the problem of identifying differentially expressed groups of genes from a microarray experiment. The groups of genes are externally defined, for example, sets of gene pathways derived from biological databases. Our starting point is the interesting Gene Set Enrichment Analysis (GSEA) procedure of Subramanian et al. [Proc. Natl. Acad. Sci. USA 102 (2005) 15545--15550]. We study the problem in some generality and propose two potential improvements to GSEA: the maxmean statistic for summarizing gene-sets, and restandardization for more accurate inferences. We discuss a variety of examples and extensions, including the use of gene-set scores for class predictions. We also describe a new R language package GSA that implements our ideas.

722 citations


Journal ArticleDOI
TL;DR: A genome‐wide array‐based comparative genomic hybridization (array CGH) survey of CNAs in 89 breast tumors from a patient cohort with locally advanced disease links distinct cytoband loci harboring CNAs to specific clinicopathological parameters, including tumor grade, estrogen receptor status, presence of TP53 mutation, and overall survival.
Abstract: Breast cancer is a leading cause of cancer-death among women, where the clinicopathological features of tumors are used to prognosticate and guide therapy. DNA copy number alterations (CNAs), which occur frequently in breast cancer and define key pathogenetic events, are also potentially useful prognostic or predictive factors. Here, we report a genome-wide array-based comparative genomic hybridization (array CGH) survey of CNAs in 89 breast tumors from a patient cohort with locally advanced disease. Statistical analysis links distinct cytoband loci harboring CNAs to specific clinicopathological parameters, including tumor grade, estrogen receptor status, presence of TP53 mutation, and overall survival. Notably, distinct spectra of CNAs also underlie the different subtypes of breast cancer recently defined by expression-profiling, implying these subtypes develop along distinct genetic pathways. In addition, higher numbers of gains/losses are associated with the "basal-like" tumor subtype, while high-level DNA amplification is more frequent in "luminal-B" subtype tumors, suggesting also that distinct mechanisms of genomic instability might underlie their pathogenesis. The identified CNAs may provide a basis for improved patient prognostication, as well as a starting point to define important genes to further our understanding of the pathobiology of breast cancer. This article contains Supplementary Material available at http://www.interscience.wiley.com/jpages/1045-2257/suppmat

509 citations


Journal ArticleDOI
TL;DR: In this paper, the Eppendorf polarographic electrode was used to measure tumor oxygenation in resectable non-small cell lung cancers (NSCLC) and to correlate tumor pO2 and the selected gene and protein expression to treatment outcomes.
Abstract: Background: To directly assess tumor oxygenation in resectable non - small cell lung cancers (NSCLC) and to correlate tumor pO2 and the selected gene and protein expression to treatment outcomes. Methods: Twenty patients with resectable NSCLC were enrolled. Intraoperative measurements of normal lung and tumor pO2 were done with the Eppendorf polarographic electrode. All patients had plasma osteopontin measurements by ELISA. Carbonic anhydrase-IX (CA IX) staining of tumor sections was done in the majority of patients (n = 16), as was gene expression profiling (n = 12) using cDNA microarrays. Tumor pO2 was correlated with CA IX staining, osteopontin levels, and treatment outcomes. Results: The median tumor pO2 ranged from 0.7 to 46 mm Hg (median, 16.6) and was lower than normal lung pO2 in all but one patient. Because both variables were affected by the completeness of lung deflation during measurement, we used the ratio of tumor/normal lung (T/L) pO2 as a reflection of tumor oxygenation. The median T/L pO 2 was 0.13. T/L pO2 correlated significantly with plasma osteopontin levels (r = 0.53, P = 0.02) and CA IX expression (P = 0.006). Gene expression profiling showed that high CD44 expression was a predictor for relapse, which was confirmed by tissue staining of CD44 variant 6 protein. Other variables associated with the risk of relapse were T stage (P = 0.02), T/L pO2 (P = 0.04), and osteopontin levels (P = 0.001). Conclusions: Tumor hypoxia exists in resectable NSCLC and is associated with elevated expression of osteopontin and CA IX. Tumor hypoxia and elevated osteopontin levels and CD44 expression correlated with poor prognosis. A larger study is needed to confirm the prognostic significance of these factors. © 2006 American Association for Cancer Research.

245 citations


Journal ArticleDOI
05 Jan 2006-Oncogene
TL;DR: The findings support a role of altered apoptotic balance in the pathogenesis of SCLC, and suggest that MYC family genes might affect oncogenesis through distinct sets of targets, in particular implicating the importance of transcriptional repression.
Abstract: DNA amplifications and deletions frequently contribute to the development and progression of lung cancer. To identify such novel alterations in small cell lung cancer (SCLC), we performed comparative genomic hybridization on a set of 24 SCLC cell lines, using cDNA microarrays representing approximately 22,000 human genes (providing an average mapping resolution of <70 kb). We identified localized DNA amplifications corresponding to oncogenes known to be amplified in SCLC, including MYC (8q24), MYCN (2p24) and MYCL1 (1p34). Additional highly localized DNA amplifications suggested candidate oncogenes not previously identified as amplified in SCLC, including the antiapoptotic genes TNFRSF4 (1p36), DAD1 (14q11), BCL2L1 (20q11) and BCL2L2 (14q11). Likewise, newly discovered PCR-validated homozygous deletions suggested candidate tumor-suppressor genes, including the proapoptotic genes MAPK10 (4q21) and TNFRSF6 (10q23). To characterize the effect of DNA amplification on gene expression patterns, we performed expression profiling using the same microarray platform. Among our findings, we identified sets of genes whose expression correlated with MYC, MYCN or MYCL1 amplification, with surprisingly little overlap among gene sets. While both MYC and MYCN amplification were associated with increased and decreased expression of known MYC upregulated and downregulated targets, respectively, MYCL1 amplification was associated only with the latter. Our findings support a role of altered apoptotic balance in the pathogenesis of SCLC, and suggest that MYC family genes might affect oncogenesis through distinct sets of targets, in particular implicating the importance of transcriptional repression.

188 citations


Journal ArticleDOI
TL;DR: Evidence is presented in support of the most consistently identifiable subtypes of breast cancer tumor tissue microarrays being: ESR1+/ERBB2, E SR1-/ER BB2-, and ERBB2+ (collectively called the ESR 1/ERGBB2 subtypes).
Abstract: Previous studies demonstrated breast cancer tumor tissue samples could be classified into different subtypes based upon DNA microarray profiles. The most recent study presented evidence for the existence of five different subtypes: normal breast-like, basal, luminal A, luminal B, and ERBB2+. Based upon the analysis of 599 microarrays (five separate cDNA microarray datasets) using a novel approach, we present evidence in support of the most consistently identifiable subtypes of breast cancer tumor tissue microarrays being: ESR1+/ERBB2-, ESR1-/ERBB2-, and ERBB2+ (collectively called the ESR1/ERBB2 subtypes). We validate all three subtypes statistically and show the subtype to which a sample belongs is a significant predictor of overall survival and distant-metastasis free probability. As a consequence of the statistical validation procedure we have a set of centroids which can be applied to any microarray (indexed by UniGene Cluster ID) to classify it to one of the ESR1/ERBB2 subtypes. Moreover, the method used to define the ESR1/ERBB2 subtypes is not specific to the disease. The method can be used to identify subtypes in any disease for which there are at least two independent microarray datasets of disease samples.

141 citations


Journal ArticleDOI
TL;DR: This short article discusses a simple method for assessing sample size requirements in microarray experiments by estimating the false discovery rate and false negative rate of a list of genes for a given hypothesized mean difference and various samples sizes.
Abstract: In this short article, we discuss a simple method for assessing sample size requirements in microarray experiments. Our method starts with the output from a permutation-based analysis for a set of pilot data, e.g. from the SAM package. Then for a given hypothesized mean difference and various samples sizes, we estimate the false discovery rate and false negative rate of a list of genes; these are also interpretable as per gene power and type I error. We also discuss application of our method to other kinds of response variables, for example survival outcomes. Our method seems to be useful for sample size assessment in microarray experiments.

108 citations


Journal ArticleDOI
TL;DR: Both treatment outcome and race are associated with different transcriptional responses to interferon treatment outcomes, and key factors affecting the outcome of IFN‐α therapy are likely to act at the JAK‐STAT pathway that controls transcription of downstream ISGs.

92 citations


Journal ArticleDOI
TL;DR: H. pylori eradication may stop or reverse ongoing molecular processes in the stomach, and genes involved in cell-cell adhesion and lining, cell cycle differentiation, and lipid metabolism and transport were down-regulated over time in the treatment group but up-regulated in the placebo group.
Abstract: Helicobacter pylori causes gastric preneoplasia and neoplasia Eradicating H pylori can result in partial regression of preneoplastic lesions; however, the molecular underpinning of this change is unknown To identify molecular changes in the gastric mucosa following H pylori eradication, we used cDNA microarrays (with each array containing approximately 30,300 genes) to analyze 54 gastric biopsies from a randomized, placebo-controlled trial of H pylori therapy The 54 biopsies were obtained from 27 subjects (13 from the treatment and 14 from the placebo group) with chronic gastritis, atrophy, and/or intestinal metaplasia Each subject contributed one biopsy before and another biopsy 1 year after the intervention Significant analysis of microarrays (SAM) was used to compare the gene expression profiles of pre-intervention and post-intervention biopsies In the treatment group, SAM identified 30 genes whose expression changed significantly from baseline to 1 year after treatment (0 up-regulated and 30 down-regulated) In the placebo group, the expression of 55 genes differed significantly over the 1-year period (32 up-regulated and 23 down-regulated) Five genes involved in cell-cell adhesion and lining (TACSTD1 and MUC13), cell cycle differentiation (S100A10), and lipid metabolism and transport (FABP1 and MTP) were down-regulated over time in the treatment group but up-regulated in the placebo group Immunohistochemistry for one of these differentially expressed genes (FABP1) confirmed the changes in gene expression observed by microarray In conclusion, H pylori eradication may stop or reverse ongoing molecular processes in the stomach Further studies are needed to evaluate the use of these genes as markers for gastric cancer risk

25 citations


Posted Content
TL;DR: This work proposes a method for detecting differential gene expression that exploits the correlation between genes and averages the univariate scores of each feature with the scores in correlation neighborhoods to achieve correlation-sharing.
Abstract: We propose a method for detecting differential gene expression that exploits the correlation between genes. Our proposal averages the univariate scores of each feature with the scores in correlation neighborhoods. In a number of real and simulated examples, the new method often exhibits lower false discovery rates than simple t-statistic thresholding. We also provide some analysis of the asymptotic behavior of our proposal. The general idea of correlation-sharing can be applied to other prediction problems involving a large number of correlated features. We give an example in protein mass spectrometry.

Journal ArticleDOI
16 Nov 2006-Blood
TL;DR: Intra-tumoral injection of PF-3512676 (CpG 7909), at a fixed dose of 6mg, combined with low-dose radiation (2Gy x 2) is a safe and well tolerated regimen in patients with recurrent low-grade lymphomas, and an anti-Tumor effect has been observed.

Journal ArticleDOI
16 Nov 2006-Blood
TL;DR: There is no evidence that the percentage of tumor-infiltrating T cells or their subsets is predictive of clinical outcome in follicular lymphoma.

01 Jan 2006
TL;DR: This method generalizes the idea of the “nearest shrunken centroids” (NSC) into the classical discriminant analysis and shows that this method performs very well in multivariate classification problems, often outperforms the PAM method and can be as competitive as PAM.
Abstract: In this paper, we introduce a modified version of linear discriminant analysis, called the “shrunken centroids regularized discriminant analysis” (SCRDA). This method generalizes the idea of the “nearest shrunken centroids” (NSC) (Tibshirani et al., 2003) into the classical discriminant analysis. The SCRDA method is specially designed for classification problems in high dimension low sample size situations, for example, microarray data. Through both simulated data and real life data, it is shown that this method performs very well in multivariate classification problems, often outperforms the PAM method (using the NSC algorithm) and can be as competitive as

Journal ArticleDOI
16 Nov 2006-Blood
TL;DR: Risk assignment for patients with NK may be feasible by analyzing a limited number of genes, and two genes previously not identified as related to outcome in AML are identified in the analysis in two highly interacting networks related to the MYC gene.