scispace - formally typeset
Search or ask a question

Showing papers by "Robert Tibshirani published in 2005"


Journal ArticleDOI
TL;DR: The fused lasso is proposed, a generalization that is designed for problems with features that can be ordered in some meaningful way, and is especially useful when the number of features p is much greater than N, the sample size.
Abstract: Summary. The lasso penalizes a least squares regression by the sum of the absolute values (L1-norm) of the coefficients. The form of this penalty encourages sparse solutions (with many coefficients equal to 0). We propose the ‘fused lasso’, a generalization that is designed for problems with features that can be ordered in some meaningful way. The fused lasso penalizes the L1-norm of both the coefficients and their successive differences. Thus it encourages sparsity of the coefficients and also sparsity of their differences—i.e. local constancy of the coefficient profile. The fused lasso is especially useful when the number of features p is much greater than N, the sample size. The technique is also extended to the ‘hinge’ loss function that underlies the support vector classifier.We illustrate the methods on examples from protein mass spectroscopy and gene expression data.

2,760 citations


Journal ArticleDOI
TL;DR: It is shown that both overall survival and distant metastasis-free survival are markedly diminished in patients whose tumors expressed this wound-response signature compared to tumors that did not express this signature.
Abstract: Based on the hypothesis that features of the molecular program of normal wound healing might play an important role in cancer metastasis, we previously identified consistent features in the transcriptional response of normal fibroblasts to serum, and used this “wound-response signature” to reveal links between wound healing and cancer progression in a variety of common epithelial tumors. Here, in a consecutive series of 295 early breast cancer patients, we show that both overall survival and distant metastasis-free survival are markedly diminished in patients whose tumors expressed this wound-response signature compared to tumors that did not express this signature. A gene expression centroid of the wound-response signature provides a basis for prospectively assigning a prognostic score that can be scaled to suit different clinical purposes. The wound-response signature improves risk stratification independently of known clinico-pathologic risk factors and previously established prognostic signatures based on unsupervised hierarchical clustering (“molecular subtypes”) or supervised predictors of metastasis (“70-gene prognosis signature”).

978 citations


Journal ArticleDOI
TL;DR: The key idea is to view clustering as a supervised classification problem, in which the “true” class labels are estimated, and the resulting “prediction strength” measure assesses how many groups can be predicted from the data, and how well.
Abstract: This article proposes a new quantity for assessing the number of groups or clusters in a dataset. The key idea is to view clustering as a supervised classification problem, in which we must also estimate the “true” class labels. The resulting “prediction strength” measure assesses how many groups can be predicted from the data, and how well. In the process, we develop novel notions of bias and variance for unlabeled data. Prediction strength performs well in simulation studies, and we apply it to clusters of breast cancer samples from a DNA microarray study. Finally, some consistency properties of the method are established.

594 citations


Journal ArticleDOI
TL;DR: A new algorithm 'Cluster along chromosomes' (CLAC) is proposed for the analysis of array CGH data, which builds hierarchical clustering-style trees along each chromosome arm (or chromosome), and then selects the 'interesting' clusters by controlling the False Discovery Rate (FDR) at a certain level.
Abstract: Array CGH is a powerful technique for genomic studies of cancer. It enables one to carry out genome-wide screening for regions of genetic alterations, such as chromosome gains and losses, or localized amplifications and deletions. In this paper, we propose a new algorithm 'Cluster along chromosomes' (CLAC) for the analysis of array CGH data. CLAC builds hierarchical clustering-style trees along each chromosome arm (or chromosome), and then selects the 'interesting' clusters by controlling the False Discovery Rate (FDR) at a certain level. In addition, it provides a consensus summary across a set of arrays, as well as an estimate of the corresponding FDR. We illustrate the method using an application of CLAC on a lung cancer microarray CGH data set as well as a BAC array CGH data set of aneuploid cell strains.

246 citations


Journal ArticleDOI
TL;DR: In this article, a set of 259 genes were identified that predict disease-specific survival among patients in the independent validation group (p < 0.001), in multivariate analysis, the gene expression predictor was a strong predictor of survival independent of tumor stage, grade, and performance status.
Abstract: Background Conventional renal cell carcinoma (cRCC) accounts for most of the deaths due to kidney cancer. Tumor stage, grade, and patient performance status are used currently to predict survival after surgery. Our goal was to identify gene expression features, using comprehensive gene expression profiling, that correlate with survival. Methods and Findings Gene expression profiles were determined in 177 primary cRCCs using DNA microarrays. Unsupervised hierarchical clustering analysis segregated cRCC into five gene expression subgroups. Expression subgroup was correlated with survival in long-term follow-up and was independent of grade, stage, and performance status. The tumors were then divided evenly into training and test sets that were balanced for grade, stage, performance status, and length of follow-up. A semisupervised learning algorithm (supervised principal components analysis) was applied to identify transcripts whose expression was associated with survival in the training set, and the performance of this gene expression-based survival predictor was assessed using the test set. With this method, we identified 259 genes that accurately predicted disease-specific survival among patients in the independent validation group (p < 0.001). In multivariate analysis, the gene expression predictor was a strong predictor of survival independent of tumor stage, grade, and performance status (p < 0.001). Conclusions cRCC displays molecular heterogeneity and can be separated into gene expression subgroups that correlate with survival after surgery. We have identified a set of 259 genes that predict survival after surgery independent of clinical prognostic factors.

215 citations


Journal ArticleDOI
TL;DR: The findings suggest candidate genes and pathways, which may contribute to the development or progression of pancreatic cancer, and novel localized amplicons, suggesting previously unrecognized candidate oncogenes.

202 citations


Journal ArticleDOI
TL;DR: A hybrid clustering method that combines the strengths of bottom-up hierarchical clustering with that of top-down clustering, built on the new idea of a mutual cluster: a group of points closer to each other than to any other points.
Abstract: In this paper, we propose a hybrid clustering method that combines the strengths of bottom-up hierarchical clustering with that of top-down clustering. The first method is good at identifying small clusters but not large ones; the strengths are reversed for the second method. The hybrid method is built on the new idea of a mutual cluster: a group of points closer to each other than to any other points. Theoretical connections between mutual clusters and bottom-up clustering methods are established, aiding in their interpretation and providing an algorithm for identification of mutual clusters. We illustrate the technique on simulated and real microarray datasets.

159 citations


Journal ArticleDOI
TL;DR: The results show that a blood-based gene-expression test can be developed to detect breast cancer early in asymptomatic patients and additional studies with a large sample size are warranted to confirm or refute this finding.
Abstract: Existing methods to detect breast cancer in asymptomatic patients have limitations, and there is a need to develop more accurate and convenient methods. In this study, we investigated whether early detection of breast cancer is possible by analyzing gene-expression patterns in peripheral blood cells. Using macroarrays and nearest-shrunken-centroid method, we analyzed the expression pattern of 1,368 genes in peripheral blood cells of 24 women with breast cancer and 32 women with no signs of this disease. The results were validated using a standard leave-one-out cross-validation approach. We identified a set of 37 genes that correctly predicted the diagnostic class in at least 82% of the samples. The majority of these genes had a decreased expression in samples from breast cancer patients, and predominantly encoded proteins implicated in ribosome production and translation control. In contrast, the expression of some defense-related genes was increased in samples from breast cancer patients. The results show that a blood-based gene-expression test can be developed to detect breast cancer early in asymptomatic patients. Additional studies with a large sample size, from women both with and without the disease, are warranted to confirm or refute this finding.

134 citations


Journal ArticleDOI
TL;DR: The asymptotic distribution of the tail strength measure is derived, and its use on a number of real datasets is illustrated.
Abstract: SUMMARY We propose an overall measure of significance for a set of hypothesis tests. The ‘tail strength’ is a simple function of the p-values computed for each of the tests. This measure is useful, for example, in assessing the overall univariate strength of a large set of features in microarray and other genomic and biomedical studies. It also has a simple relationship to the false discovery rate of the collection of tests. We derive the asymptotic distribution of the tail strength measure, and illustrate its use on a number of real datasets.

82 citations


Journal ArticleDOI
TL;DR: Dynamic computation of interaction networks and mapping to existing pathways knowledge databases revealed a potential role of EGR1 in p21-induced cell cycle arrest and intrinsic chemotherapy resistance of mature teratomas.
Abstract: Germ cell tumors (GCTs) of the testis are the predominant cancer among young men. We analyzed gene expression profiles of 50 GCTs of various subtypes, and we compared them with 443 other common malignant tumors of epithelial, mesenchymal, and lymphoid origins. Significant differences in gene expression were found among major histological subtypes of GCTs, and between them and other malignancies. We identified 511 genes, belonging to several critical functional groups such as cell cycle progression, cell proliferation, and apoptosis, to be significantly differentially expressed in GCTs compared with other tumor types. Sixty-five genes were sufficient for the construction of a GCT class predictor of high predictive accuracy (100% training set, 96% test set), which might be useful in the diagnosis of tumors of unknown primary origin. Previously described diagnostic and prognostic markers were found to be expressed by the appropriate GCT subtype (AFP, POU5F1, POV1, CCND2, and KIT). Several additional differentially expressed genes were identified in teratomas (EGR1 and MMP7), yolk sac tumors (PTPN13 and FN1), and seminomas (NR6A1, DPPA4, and IRX1). Dynamic computation of interaction networks and mapping to existing pathways knowledge databases revealed a potential role of EGR1 in p21-induced cell cycle arrest and intrinsic chemotherapy resistance of mature teratomas.

80 citations


Journal ArticleDOI
TL;DR: These studies significantly focus efforts aimed at identifying central gene regulatory pathways that mediate atherosclerotic disease, and the identification of classification gene sets offers unique insights into potential diagnostic and therapeutic strategies in atherosclerosis.
Abstract: The propensity for developing atherosclerosis is dependent on underlying genetic risk and varies as a function of age and exposure to environmental risk factors. Employing three mouse models with different disease susceptibility, two diets, and a longitudinal experimental design, it was possible to manipulate each of these factors to focus analysis on genes most likely to have a specific disease-related function. To identify differences in longitudinal gene expression patterns of atherosclerosis, we have developed and employed a statistical algorithm that relies on generalized regression and permutation analysis. Comprehensive annotation of the array with ontology and pathway terms has allowed rigorous identification of molecular and biological processes that underlie disease pathophysiology. The repertoire of atherosclerosis-related immunomodulatory genes has been extended, and additional fundamental pathways have been identified. This highly disease-specific group of mouse genes was combined with an extensive human coronary artery data set to identify a shared group of genes differentially regulated among atherosclerotic tissues from different species and different vascular beds. A small core subset of these differentially regulated genes was sufficient to accurately classify various stages of the disease in mouse. The same gene subset was also found to accurately classify human coronary lesion severity. In addition, this classifier gene set was able to distinguish with high accuracy atherectomy specimens from native coronary artery disease vs. those collected from in-stent restenosis lesions, thus identifying molecular differences between these two processes. These studies significantly focus efforts aimed at identifying central gene regulatory pathways that mediate atherosclerotic disease, and the identification of classification gene sets offers unique insights into potential diagnostic and therapeutic strategies in atherosclerotic disease.

Journal ArticleDOI
TL;DR: The aim of this study was to characterize gene expression and DNA copy number profiles in androgen sensitive and androgen insensitive prostate cancer cell lines on a genome‐wide scale.
Abstract: BACKGROUND The aim of this study was to characterize gene expression and DNA copy number profiles in androgen sensitive (AS) and androgen insensitive (AI) prostate cancer cell lines on a genome-wide scale. METHODS Gene expression profiles and DNA copy number changes were examined using DNA microarrays in eight commonly used prostate cancer cell lines. Chromosomal regions with DNA copy number changes were identified using cluster along chromosome (CLAC). RESULTS There were discrete differences in gene expression patterns between AS and AI cells that were not limited to androgen-responsive genes. AI cells displayed more DNA copy number changes, especially amplifications, than AS cells. The gene expression profiles of cell lines showed limited similarities to prostate tumors harvested at surgery. CONCLUSIONS AS and AI cell lines are different in their transcriptional programs and degree of DNA copy number alterations. This dataset provides a context for the use of prostate cancer cell lines as models for clinical cancers. © 2004 Wiley-Liss, Inc.

Journal ArticleDOI
TL;DR: A complementary measure, the 'miss rate', is discussed, and how to estimate it in practice is shown, which is important in multiple testing issues in gene expression studies.
Abstract: SUMMARY Multiple testing issues are important in gene expression studies, where typically thousands of genes are compared over two or more experimental conditions. The false discovery rate has become a popular measure in this setting. Here we discuss a complementary measure, the ‘miss rate’, and show how to estimate it in practice.

Reference EntryDOI
15 Jul 2005

Book
01 Jan 2005
TL;DR: Two new methods, based on lasso, that can produce sparse, interpretable regression models that relate clusters of co-expressed genes to a quantitative phenotype are proposed, and a need for supervised clustering of genes is discussed, that is, the phenotype ought to have an influence on how genes are clustered.
Abstract: In the past decade, DNA and oligonucleotide microarray technology has been developed, allowing gene expression levels to be measured on a genome-wide scale. Use of this massive amount of molecular information appears to be promising for discovering genetic networks. Classification based on microarray experiments has been studied extensively. In comparison, microarray gene expression data has been analyzed less frequently in a regression set-up. From a statistical point of view, the challenge with analyzing microarray gene expression data is due to the very large number of genes, which far exceeds the sample size, i.e., the so-called “large p, small n” scenario. The lasso (least absolute shrinkage and selection operator) method is a promising regression method that incorporates automatic variable selection by imposing an L1 penalty on the regression coefficients. However the lasso method has its limitations in the “large p, small n” scenario. When p > n, the lasso method can select up to n variables before it saturates. And the lasso method does not offer a “grouped selection” effect. Therefore we propose two new methods, based on lasso, that are particularly suitable for microarray data regression analysis. The methods can produce sparse, interpretable regression models that relate clusters of co-expressed genes to a quantitative phenotype. Our methods are tested on simulated data sets as well as real microarray data sets. Besides the proposal of novel regression methods, we also propose quantitative definitions for evaluating the strength of the “grouped variable” effect in fitted regression models. The new definitions allow us to compare regression models quantitatively. We then discuss a need for supervised clustering of genes, that is, the phenotype ought to have an influence on how genes are clustered. One potential approach is to re-define the distances between pairs of genes by incorporating the phenotype into the definition of the new distance metric.

Journal ArticleDOI
16 Nov 2005-Blood
TL;DR: The findings suggest that the differing GEPs between the MDS and Normal CD34+ cells were not due to major differences in their proportions of CD38 cell subsets, and molecular criteria refining the prognostic categorization of MDS is refined.


01 Jan 2005
TL;DR: In this paper, the authors propose an overall measure of signicance for a set of hypothesis tests, which is a simple function of the p-values computed for each of the tests.
Abstract: We propose an overall measure of signicance for a set of hypothesis tests. The tail strength is a simple function of the p-values computed for each of the tests. This measure is useful, for example, in assessing the overall univariate strength of a large set of features in microarray and other genomic and biomedical studies. It also has a simple relationship to the false discovery rate of the collection of tests. We derive the asymptotic distribution of the tail strength measure, and illustrate its use on a number of real datasets.