scispace - formally typeset
Search or ask a question

Showing papers by "Robert Tibshirani published in 2010"


Journal ArticleDOI
TL;DR: In comparative timings, the new algorithms are considerably faster than competing methods and can handle large problems and can also deal efficiently with sparse features.
Abstract: We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multinomial regression problems while the penalties include l(1) (the lasso), l(2) (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.

13,656 citations


Journal Article
TL;DR: Using the nuclear norm as a regularizer, the algorithm Soft-Impute iteratively replaces the missing elements with those obtained from a soft-thresholded SVD in a sequence of regularized low-rank solutions for large-scale matrix completion problems.
Abstract: We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm SOFT-IMPUTE iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity of order linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices; for example SOFT-IMPUTE takes a few hours to compute low-rank approximations of a 106 X 106 incomplete matrix with 107 observed entries, and fits a rank-95 approximation to the full Netflix training set in 3.3 hours. Our methods achieve good training and test errors and exhibit superior timings when compared to other competitive state-of-the-art techniques.

1,195 citations


Posted Content
TL;DR: An ecien t algorithm is derived for the resulting convex problem based on coordinate descent that can be used to solve the general form of the group lasso, with non-orthonormal model matrices.
Abstract: We consider the group lasso penalty for the linear model. We note that the standard algorithm for solving the problem assumes that the model matrices in each group are orthonormal. Here we consider a more general penalty that blends the lasso (L1) with the group lasso (\two-norm"). This penalty yields solutions that are sparse at both the group and individual feature levels. We derive an ecien t algorithm for the resulting convex problem based on coordinate descent. This algorithm can also be used to solve the general form of the group lasso, with non-orthonormal model matrices.

800 citations


Journal ArticleDOI
TL;DR: A novel framework for sparse clustering is proposed, in which one clusters the observations using an adaptively chosen subset of the features, which uses a lasso-type penalty to select the features.
Abstract: We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse clustering, in which one clusters the observations using an adaptively chosen subset of the features. The method uses a lasso-type penalty to select the features. We use this framework to develop simple methods for sparse K-means and sparse hierarchical clustering. A single criterion governs both the selection of the features and the resulting clusters. These approaches are demonstrated on simulated and genomic data.

643 citations


Journal ArticleDOI
TL;DR: This work validated csSAM with predesigned mixtures and applied it to whole-blood gene expression datasets from stable post-transplant kidney transplant recipients and those experiencing acute transplant rejection, which revealed hundreds of differentially expressed genes that were otherwise undetectable.
Abstract: We describe cell type-specific significance analysis of microarrays (csSAM) for analyzing differential gene expression for each cell type in a biological sample from microarray data and relative cell-type frequencies. First, we validated csSAM with predesigned mixtures and then applied it to whole-blood gene expression datasets from stable post-transplant kidney transplant recipients and those experiencing acute transplant rejection, which revealed hundreds of differentially expressed genes that were otherwise undetectable.

499 citations


Journal ArticleDOI
TL;DR: In situ tumor vaccination with a TLR9 agonist induces systemic antilymphoma clinical responses and is clinically feasible and does not require the production of a customized vaccine product.
Abstract: Purpose Combining tumor antigens with an immunostimulant can induce the immune system to specifically eliminate cancer cells. Generally, this combination is accomplished in an ex vivo, customized manner. In a preclinical lymphoma model, intratumoral injection of a Toll-like receptor 9 (TLR9) agonist induced systemic antitumor immunity and cured large, disseminated tumors. Patients and Methods We treated 15 patients with low-grade B-cell lymphoma using low-dose radiotherapy to a single tumor site and—at that same site—injected the C-G enriched, synthetic oligodeoxynucleotide (also referred to as CpG) TLR9 agonist PF-3512676. Clinical responses were assessed at distant, untreated tumor sites. Immune responses were evaluated by measuring T-cell activation after in vitro restimulation with autologous tumor cells. Results This in situ vaccination maneuver was well-tolerated with only grade 1 to 2 local or systemic reactions and no treatment-limiting adverse events. One patient had a complete clinical response,...

443 citations


Journal ArticleDOI
19 Nov 2010-Blood
TL;DR: It is concluded that the measurement of a single gene expressed by tumor cells (LMO2) and a single genes expressed by the immune microenvironment (TNFRSF9) powerfully predicts overall survival in patients with DLBCL.

183 citations


Journal ArticleDOI
TL;DR: A novel application of a log linear model has been described that resulted in the identification of 67 miRNAs that were differentially-expressed between the tumour and normal samples at a false discovery rate less than 0.001.
Abstract: Ultra-high throughput sequencing technologies provide opportunities both for discovery of novel molecular species and for detailed comparisons of gene expression patterns. Small RNA populations are particularly well suited to this analysis, as many different small RNAs can be completely sequenced in a single instrument run. We prepared small RNA libraries from 29 tumour/normal pairs of human cervical tissue samples. Analysis of the resulting sequences (42 million in total) defined 64 new human microRNA (miRNA) genes. Both arms of the hairpin precursor were observed in twenty-three of the newly identified miRNA candidates. We tested several computational approaches for the analysis of class differences between high throughput sequencing datasets and describe a novel application of a log linear model that has provided the most effective analysis for this data. This method resulted in the identification of 67 miRNAs that were differentially-expressed between the tumour and normal samples at a false discovery rate less than 0.001. This approach can potentially be applied to any kind of RNA sequencing data for analysing differential sequence representation between biological sample sets.

174 citations


Journal ArticleDOI
TL;DR: A number of methods from the literature are reviewed that address the problems of identifying features that are associated with survival and developing a multivariate model for the relationship between the features and survival that can be used to predict survival in a new observation.
Abstract: In recent years, breakthroughs in biomedical technology have led to a wealth of data in which the number of features (for instance, genes on which expression measurements are available) exceeds the number of observations (eg patients) Sometimes survival outcomes are also available for those same observations In this case, one might be interested in (a) identifying features that are associated with survival (in a univariate sense), and (b) developing a multivariate model for the relationship between the features and survival that can be used to predict survival in a new observation Due to the high dimensionality of this data, most classical statistical methods for survival analysis cannot be applied directly Here, we review a number of methods from the literature that address these two problems

158 citations


01 Jan 2010
TL;DR: It is found that for edge selection, a simple method based on univariate screening of the elements of the empirical correlation matrix usually performs as well or better than all of the more complex methods proposed here and elsewhere.
Abstract: We propose several methods for estimating edge-sparse and nodesparse graphical models based on lasso and grouped lasso penalties. We develop ecien t algorithms for tting these models when the numbers of nodes and potential edges are large. We compare them to competing methods including the graphical lasso and SPACE (Peng, Wang, Zhou & Zhu 2008). Surprisingly, we nd that for edge selection, a simple method based on univariate screening of the elements of the empirical correlation matrix usually performs as well or better than all of the more complex methods proposed here and elsewhere. Running title: Applications of the lasso and grouped lasso

135 citations


Journal ArticleDOI
11 Feb 2010-Oncogene
TL;DR: In this analysis that combined gene expression profiling, aCGH and IHC, distinct molecular LMS subtypes are characterized, provided insight into their pathogenesis, and identified prognostic biomarkers.
Abstract: Leiomyosarcoma (LMS) is a soft tissue tumor with a significant degree of morphologic and molecular heterogeneity. We used integrative molecular profiling to discover and characterize molecular subtypes of LMS. Gene expression profiling was performed on 51 LMS samples. Unsupervised clustering showed three reproducible LMS clusters. Array comparative genomic hybridization (aCGH) was performed on 20 LMS samples and showed that the molecular subtypes defined by gene expression showed distinct genomic changes. Tumors from the 'muscle-enriched' cluster showed significantly increased copy number changes (P=0.04). A majority of the muscle-enriched cases showed loss at 16q24, which contains Fanconi anemia, complementation group A, known to have an important role in DNA repair, and loss at 1p36, which contains PRDM16, of which loss promotes muscle differentiation. Immunohistochemistry (IHC) was performed on LMS tissue microarrays (n=377) for five markers with high levels of messenger RNA in the muscle-enriched cluster (ACTG2, CASQ2, SLMAP, CFL2 and MYLK) and showed significantly correlated expression of the five proteins (all pairwise P<0.005). Expression of the five markers was associated with improved disease-specific survival in a multivariate Cox regression analysis (P<0.04). In this analysis that combined gene expression profiling, aCGH and IHC, we characterized distinct molecular LMS subtypes, provided insight into their pathogenesis, and identified prognostic biomarkers.

Journal ArticleDOI
TL;DR: In this paper, a transposable regularized covariance model is proposed to estimate the mean and non-singular covariance matrices of high-dimensional data in the form of a matrix, where rows and columns each have a separate mean vector and covariance matrix.
Abstract: Missing data estimation is an important challenge with high-dimensional data arranged in the form of a matrix. Typically this data matrix is transposable, meaning that either the rows, columns or both can be treated as features. To model transposable data, we present a modification of the matrix-variate normal, the mean-restricted matrix-variate normal, in which the rows and columns each have a separate mean vector and covariance matrix. By placing additive penalties on the inverse covariance matrices of the rows and columns, these so called transposable regularized covariance models allow for maximum likelihood estimation of the mean and non-singular covariance matrices. Using these models, we formulate EM-type algorithms for missing data imputation in both the multivariate and transposable frameworks. We present theoretical results exploiting the structure of our transposable models that allow these models and imputation methods to be applied to high-dimensional data. Simulations and results on microarray data and the Netflix data show that these imputation techniques often outperform existing methods and offer a greater degree of flexibility.

Journal ArticleDOI
19 Jan 2010-PLOS ONE
TL;DR: It is demonstrated that 3SEQ is an effective technique for gene expression profiling from archival tumor samples and may facilitate significant advances in translational cancer research.
Abstract: Gene expression microarrays are the most widely used technique for genome-wide expression profiling. However, microarrays do not perform well on formalin fixed paraffin embedded tissue (FFPET). Consequently, microarrays cannot be effectively utilized to perform gene expression profiling on the vast majority of archival tumor samples. To address this limitation of gene expression microarrays, we designed a novel procedure (3'-end sequencing for expression quantification (3SEQ)) for gene expression profiling from FFPET using next-generation sequencing. We performed gene expression profiling by 3SEQ and microarray on both frozen tissue and FFPET from two soft tissue tumors (desmoid type fibromatosis (DTF) and solitary fibrous tumor (SFT)) (total n = 23 samples, which were each profiled by at least one of the four platform-tissue preparation combinations). Analysis of 3SEQ data revealed many genes differentially expressed between the tumor types (FDR<0.01) on both the frozen tissue (approximately 9.6K genes) and FFPET (approximately 8.1K genes). Analysis of microarray data from frozen tissue revealed fewer differentially expressed genes (approximately 4.64K), and analysis of microarray data on FFPET revealed very few (69) differentially expressed genes. Functional gene set analysis of 3SEQ data from both frozen tissue and FFPET identified biological pathways known to be important in DTF and SFT pathogenesis and suggested several additional candidate oncogenic pathways in these tumors. These findings demonstrate that 3SEQ is an effective technique for gene expression profiling from archival tumor samples and may facilitate significant advances in translational cancer research.

Journal ArticleDOI
TL;DR: It is postulate that VEGFR1 may oppose autocrine V EGFR2 signalling in DLBCL by competing for VEGF binding, and this finding is concordant with the prior finding of an association of VEGfr1 with longer OS inDLBCL treated with chemotherapy alone.
Abstract: Diffuse large B cell lymphoma (DLBCL) is clinically and biologically heterogeneous. In most cases of DLBCL, lymphoma cells co-express vascular endothelial growth factor (VEGF) and its receptors VEGFR1 and VEGFR2, suggesting autocrine in addition to angiogenic effects. We enumerated microvessel density and scored lymphoma cell expression of VEGF, VEGFR1, VEGFR2 and phosphorylated VEGFR2 in 162 de novo DLBCL patients treated with R-CHOP (rituximab, cyclophosphamide, vincristine, doxorubicin and prednisone)-like regimens. VEGFR2 expression correlated with shorter overall survival (OS) independent of International Prognostic Index (IPI) (P = 0.0028). Phosphorylated VEGFR2 (detected in 13% of cases) correlated with shorter progression-free survival (PFS, P = 0.044) and trended toward shorter OS on univariate analysis. VEGFR1 was not predictive of survival on univariate analysis, but it did correlate with better OS on multivariate analysis with VEGF, VEGFR2 and IPI (P = 0.036); in patients with weak VEGFR2, lack of VEGFR1 coexpression was significantly correlated with poor OS independent of IPI (P = 0.01). These results are concordant with our prior finding of an association of VEGFR1 with longer OS in DLBCL treated with chemotherapy alone. We postulate that VEGFR1 may oppose autocrine VEGFR2 signalling in DLBCL by competing for VEGF binding. In contrast to our prior results with chemotherapy alone, microvessel density was not prognostic of PFS or OS with R-CHOP-like therapy.

Journal ArticleDOI
TL;DR: High-dimensional flow cytometry analysis of normal hematopoietic tissue confirmed that among B- and T-cell subsets, germinal center B cells showed the highest level of CD81 expression and its role in the risk stratification of patients with diffuse large B-cell lymphoma.

Journal ArticleDOI
TL;DR: DNA/RNA-Integrator is introduced, a statistical software tool to perform integrative analyses on paired DNA copy number and gene expression data and implements a supervised analysis that captures genes with significant alterations in both DNAcopy number and Gene expression between two sample classes.
Abstract: Summary: DNA copy number alterations (CNA) frequently underlie gene expression changes by increasing or decreasing gene dosage. However, only a subset of genes with altered dosage exhibit concordant changes in gene expression. This subset is likely to be enriched for oncogenes and tumor suppressor genes, and can be identified by integrating these two layers of genome-scale data. We introduce DNA/RNA-Integrator (DR-Integrator), a statistical software tool to perform integrative analyses on paired DNA copy number and gene expression data. DR-Integrator identifies genes with significant correlations between DNA copy number and gene expression, and implements a supervised analysis that captures genes with significant alterations in both DNA copy number and gene expression between two sample classes. Availability: DR-Integrator is freely available for non-commercial use from the Pollack Lab at http://pollacklab.stanford.edu/ and can be downloaded as a plug-in application to Microsoft Excel and as a package for the R statistical computing environment. The R package is available under the name ‘DRI’ at http://cran.r-project.org/. An example analysis using DR-Integrator is included as supplemental material. Contact:ksalari@stanford.edu; pollack1@stanford.edu Supplementary information:Supplementary data are available at Bioinformatics online.

Posted Content
TL;DR: In this paper, the authors propose strong rules for discarding predictors in lasso regression and related problems, for computational efficiency, complemented with simple checks of the Karush- Kuhn-Tucker (KKT) conditions.
Abstract: We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui et al (2010) propose "SAFE" rules that guarantee that a coefficient will be zero in the solution, based on the inner products of each predictor with the outcome. In this paper we propose strong rules that are not foolproof but rarely fail in practice. These can be complemented with simple checks of the Karush- Kuhn-Tucker (KKT) conditions to provide safe rules that offer substantial speed and space savings in a variety of statistical convex optimization problems.

Journal ArticleDOI
TL;DR: Immunohistochemical analysis of 944 cases of hematolymphoid neoplasia identified CCR1 expression in a subset of B- and T-cell lymphomas, plasma cell myeloma, acute myeloid leukemia, and classical Hodgkin lymphoma, and suggested that C CR1 may be useful for lymphoma classification and support a role for chemokine signaling in the pathogenesis of heMatolymphoids neoplastic disease.
Abstract: Chemokine receptor 1 (CCR1) is a G protein–coupled receptor that binds to members of the C-C chemokine family Recently, CCL3 (MIP-1α), a high-affinity CCR1 ligand, was identified as part of a model that independently predicts survival in patients with diffuse large B-cell lymphoma (DLBCL) However, the role of chemokine signaling in the pathogenesis of human lymphomas is unclear In normal human hematopoietic tissues, we found CCR1 expression in intraepithelial B cells of human tonsil and granulocytic/monocytic cells in the bone marrow Immunohistochemical analysis of 944 cases of hematolymphoid neoplasia identified CCR1 expression in a subset of B- and T-cell lymphomas, plasma cell myeloma, acute myeloid leukemia, and classical Hodgkin lymphoma CCR1 expression correlated with the non–germinal center subtype of DLBCL but did not predict overall survival in follicular lymphoma These data suggest that CCR1 may be useful for lymphoma classification and support a role for chemokine signaling in the pathogenesis of hematolymphoid neoplasia

Journal ArticleDOI
TL;DR: A novel prediction approach for patient survival time that makes use of time course structure of gene expression and is consistently better than prediction methods using individual time point gene expression or simply pooling gene expression from each time point.
Abstract: Characterizing dynamic gene expression pattern and predicting patient outcome is now significant and will be of more interest in the future with large scale clinical investigation of microarrays. However, there is currently no method that has been developed for prediction of patient outcome using longitudinal gene expression, where gene expression of patients is being monitored across time. Here, we propose a novel prediction approach for patient survival time that makes use of time course structure of gene expression. This method is applied to a burn study. The genes involved in the final predictors are enriched in the inflammatory response and immune system related pathways. Moreover, our method is consistently better than prediction methods using individual time point gene expression or simply pooling gene expression from each time point.

Journal ArticleDOI
TL;DR: Genistein produces diverse effects on gene expression that are dose-dependent and this has important implications in developing genistein as a putative prostate cancer preventive agent.
Abstract: Epidemiological evidence suggests that soy consumption is associated with a decreased risk of prostate cancer. The isoflavone genistein is found at high levels in soy and a large body of evidence suggests it is important in mediating the cancer preventive effects of soy. The mechanisms through which genistein acts in prostate cancer cells have not been fully defined. We used gene expression profiling to identify genes significantly modulated by low and high doses of ge- nistein in LNCaP cells. Significant genes were identified using StepMiner analysis and significantly altered pathways with Ingenuity Pathways analysis. Genistein significantly altered expression of transcripts involved in cell growth, carcinogen defenses and steroid signaling pathways. The effects of genistein on these pathways were confirmed by directly assessing dose-related effects on LNCaP cell growth, NQO-1 enzymatic activity and PSA protein expression. Genistein produces diverse effects on gene expression that are dose-dependent and this has important implications in developing genistein as a putative prostate cancer preventive agent.

Posted Content
TL;DR: In this article, the effect of both row and column correlations on commonly used test-statistics, null distributions, and multiple testing procedures, by explicitly modeling the covariances with the matrix-variate normal distribution, is investigated.
Abstract: We consider the problem of large-scale inference on the row or column variables of data in the form of a matrix. Often this data is transposable, meaning that both the row variables and column variables are of potential interest. An example of this scenario is detecting significant genes in microarrays when the samples or arrays may be dependent due to underlying relationships. We study the effect of both row and column correlations on commonly used test-statistics, null distributions, and multiple testing procedures, by explicitly modeling the covariances with the matrix-variate normal distribution. Using this model, we give both theoretical and simulation results revealing the problems associated with using standard statistical methodology on transposable data. We solve these problems by estimating the row and column covariances simultaneously, with transposable regularized covariance models, and de-correlating or sphering the data as a pre-processing step. Under reasonable assumptions, our method gives test statistics that follow the scaled theoretical null distribution and are approximately independent. Simulations based on various models with structured and observed covariances from real microarray data reveal that our method offers substantial improvements in two areas: 1) increased statistical power and 2) correct estimation of false discovery rates.


Journal ArticleDOI
01 Jun 2010-Chance
TL;DR: The United States contains some of the world’s most dangerous roads, and the shortfall in U.S. road safety is a new issue, since American roads were considered the safest in the world 50 years ago.
Abstract: (2010). Road Crashes and the Next U.S. Presidential Election. CHANCE: Vol. 23, Election Issues, pp. 20-24.


Journal ArticleDOI
TL;DR: The development of truly personalized therapies will require ascertaining the key differences among individuals as well as similarities between cohorts within a disease type, which is a major challenge in clinical trials designs.
Abstract: We substantially agree with the comments of Catchpoole et al in response to our editorial. The idea that analyses of multidimensional, highly complex datasets of cancer and host factors will lead to more precise and successful individualized therapies is immensely appealing. We agree that building the algorithms for truly personalized medicine indeed will require paradigm shifts in clinical and translational research. The path to tailored therapies in individuals will be paved in large part by a deeper understanding of the complexity and heterogeneity of cancers. Genomics has contributed much to this understanding, as exemplified by the reclassification of breast cancers into many distinct subtypes based on gene expression profiles. Our endorsement of the so-called virtue of complexity is an appeal to strengthen the scientific rigor of genomic studies in cancer. Increasing the number of patients and samples is necessary but far from sufficient. Other important factors in conducting and reporting such studies include patient stratification, integration of various highthroughput technologies, novel trial designs and bioinformatics approaches, integration of emerging concepts in cancer biology, and transparent and complete presentation of statistical analyses. Despite the limitations of reductionism, genomic signatures derived from such approaches have proved useful in defining risks for relapse and appropriate patients for adjuvant therapies in early-stage breast cancers. Moreover, the development of predictive therapeutic biomarkers based on the understanding of molecular pathways, networks, and drug mechanisms has much to contribute to the personalization of cancer therapies, as illustrated by the importance of RAS mutation status in colorectal cancers treated with epidermal growth factor receptor–targeted antibodies. Ultimately, however, we agree that the development of truly personalized therapies will require ascertaining the key differences among individuals as well as similarities between cohorts within a disease type. Proof that tailoring makes a difference, particularly in a highly curable disease like pediatric acute lymphoblastic leukemia, is itself a major challenge in clinical trials designs.

Journal ArticleDOI
19 Nov 2010-Blood
TL;DR: A novel in situ vaccination strategy using a combination of intratumoral CpG ODN and low-dose radiation is feasible in CTCL/MF with acceptable toxicities and Reduction of skin DCs may suggest cross-priming and migration of DCs to regional lymph nodes.

Posted Content
TL;DR: This work introduces a new methodology for identifying gene sets that are differentially expressed under varying experimental conditions that uses a hierarchical Bayesian framework where a hyperparameter measures the significance of each gene set.
Abstract: Author(s): Shahbaba, Babak; Tibshirani, Robert; Shachaf, Catherine M; Plevritis, Sylvia K | Abstract: Gene expression microarray technologies provide the simultaneous measurements of a large number of genes. Typical analyses of such data focus on the individual genes, but recent work has demonstrated that evaluating changes in expression across predefined sets of genes often increases statistical power and produces more robust results. We introduce a new methodology for identifying gene sets that are differentially expressed under varying experimental conditions. Our approach uses a hierarchical Bayesian framework where a hyperparameter measures the significance of each gene set. Using simulated data, we compare our proposed method to alternative approaches, such as Gene Set Enrichment Analysis (GSEA) and Gene Set Analysis (GSA). Our approach provides the best overall performance. We also discuss the application of our method to experimental data based on p53 mutation status.