scispace - formally typeset
Search or ask a question

Showing papers by "Robert Tibshirani published in 2003"


Journal ArticleDOI
TL;DR: This work proposes an approach to measuring statistical significance in genomewide studies based on the concept of the false discovery rate, which offers a sensible balance between the number of true and false positives that is automatically calibrated and easily interpreted.
Abstract: With the increase in genomewide experiments and the sequencing of multiple genomes, the analysis of large data sets has become commonplace in biology. It is often the case that thousands of features in a genomewide data set are tested against some null hypothesis, where a number of features are expected to be significant. Here we propose an approach to measuring statistical significance in these genomewide studies based on the concept of the false discovery rate. This approach offers a sensible balance between the number of true and false positives that is automatically calibrated and easily interpreted. In doing so, a measure of statistical significance called the q value is associated with each tested feature. The q value is similar to the well known p value, except it is a measure of significance in terms of the false discovery rate rather than the false positive rate. Our approach avoids a flood of false positive results, while offering a more liberal criterion than what has been used in genome scans for linkage.

9,239 citations


Journal ArticleDOI
TL;DR: The results strongly support the idea that many of these breast tumor subtypes represent biologically distinct disease entities.
Abstract: Characteristic patterns of gene expression measured by DNA microarrays have been used to classify tumors into clinically relevant subgroups. In this study, we have refined the previously defined subtypes of breast tumors that could be distinguished by their distinct patterns of gene expression. A total of 115 malignant breast tumors were analyzed by hierarchical clustering based on patterns of expression of 534 "intrinsic" genes and shown to subdivide into one basal-like, one ERBB2-overexpressing, two luminal-like, and one normal breast tissue-like subgroup. The genes used for classification were selected based on their similar expression levels between pairs of consecutive samples taken from the same tumor separated by 15 weeks of neoadjuvant treatment. Similar cluster analyses of two published, independent data sets representing different patient cohorts from different laboratories, uncovered some of the same breast cancer subtypes. In the one data set that included information on time to development of distant metastasis, subtypes were associated with significant differences in this clinical feature. By including a group of tumors from BRCA1 carriers in the analysis, we found that this genotype predisposes to the basal tumor subtype. Our results strongly support the idea that many of these breast tumor subtypes represent biologically distinct disease entities.

5,281 citations


Proceedings Article
09 Dec 2003
TL;DR: It is argued that the 1-norm SVM may have some advantage over the standard 2- norm SVM, especially when there are redundant noise features, and an efficient algorithm is proposed that computes the whole solution path of the1-normSVM, hence facilitates adaptive selection of the tuning parameter for the 1
Abstract: The standard 2-norm SVM is known for its good performance in two-class classification. In this paper, we consider the 1-norm SVM. We argue that the 1-norm SVM may have some advantage over the standard 2-norm SVM, especially when there are redundant noise features. We also propose an efficient algorithm that computes the whole solution path of the 1-norm SVM, hence facilitates adaptive selection of the tuning parameter for the 1-norm SVM.

962 citations


Journal ArticleDOI
TL;DR: This work proposes a new method for class prediction in DNA microarray studies based on an enhancement of the nearest prototype classifier that uses "shrunken" centroids as prototypes for each class to identify the subsets of the genes that best characterize each class.
Abstract: We propose a new method for class prediction in DNA microarray studies based on an enhancement of the nearest prototype classifier. Our technique uses "shrunken" centroids as prototypes for each class to identify the subsets of the genes that best characterize each class. The method is general and can be applied to the other high-dimensional classification problems. The method is illustrated on data from two gene expression studies: lymphoma and cancer cell lines.

469 citations


Journal ArticleDOI
TL;DR: Although estrogen receptor was expressed in both the ovarian and breast cancers, genes that are coregulated with the estrogen receptor in breast cancers did not show a similar pattern of coexpression in the ovarian cancers.
Abstract: We used DNA microarrays to characterize the global gene expression patterns in surface epithelial cancers of the ovary. We identified groups of genes that distinguished the clear cell subtype from other ovarian carcinomas, grade I and II from grade III serous papillary carcinomas, and ovarian from breast carcinomas. Six clear cell carcinomas were distinguished from 36 other ovarian carcinomas (predominantly serous papillary) based on their gene expression patterns. The differences may yield insights into the worse prognosis and therapeutic resistance associated with clear cell carcinomas. A comparison of the gene expression patterns in the ovarian cancers to published data of gene expression in breast cancers revealed a large number of differentially expressed genes. We identified a group of 62 genes that correctly classified all 125 breast and ovarian cancer specimens. Among the best discriminators more highly expressed in the ovarian carcinomas were PAX8 (paired box gene 8), mesothelin, and ephrin-B1 (EFNB1). Although estrogen receptor was expressed in both the ovarian and breast cancers, genes that are coregulated with the estrogen receptor in breast cancers, including GATA-3, LIV-1, and X-box binding protein 1, did not show a similar pattern of coexpression in the ovarian cancers.

321 citations



Journal ArticleDOI
TL;DR: Analysis of sequential biopsies from patients with recurrent disease suggests that the presence of prominent extranodular L&H cells might represent early evolution to a diffuse (TCRBCL-like) pattern of NLPHL.
Abstract: Nodular lymphocyte predominant Hodgkin lymphoma (NLPHL) has traditionally been recognized as having two morphologic patterns, nodular and diffuse, and the current WHO definition of NLPHL requires at least a partial nodular pattern. Variant patterns have not been well documented. We analyzed retrospectively the morphologic and immunophenotypic patterns of NLPHL from 118 patients (total of 137 biopsy samples). Histology plus antibodies directed against CD20, CD3, and CD21 were used to evaluate the immunoarchitecture. We identified six distinct immunoarchitectural patterns in our cases of NLPHL: "classic" (B-cell-rich) nodular, serpiginous/interconnected nodular, nodular with prominent extranodular L&H cells, T-cell-rich nodular, diffuse with a T-cell-rich background (T-cell-rich B-cell lymphoma [TCRBCL]-like), and a (diffuse) B-cell-rich pattern. Small germinal centers within neoplastic nodules were found in approximately 15% of cases, a finding not previously emphasized in NLPHL. Prominent sclerosis was identified in approximately 20% of cases and was frequently seen in recurrent disease. Clinical follow-up was obtained on 56 patients, including 26 patients who had not had recurrence of disease and 30 patients who had recurrence. The follow-up period was 5 months to 16 years (median 2.5 years). The presence of a diffuse (TCRBCL-like) pattern was significantly more common in patients with recurrent disease than those without recurrence. Furthermore, the presence of a diffuse pattern (TCRBCL-like) was shown to be an independent predictor of recurrent disease (P = 0.00324). In addition, there is a tendency for progression to an increasingly more diffuse pattern over time. Analysis of sequential biopsies from patients with recurrent disease suggests that the presence of prominent extranodular L&H cells might represent early evolution to a diffuse (TCRBCL-like) pattern. We also report three patients who presented initially with diffuse large B-cell lymphoma and later developed NLPHL.

258 citations


01 Jan 2003
TL;DR: This approach avoids a flood of false positive results, while offering a more liberal criterion than what has been used in genome scans for linkage, which is a measure of statistical significance called the q-value associated with each tested feature in addition to the traditional p-value.
Abstract: With the increase in genome-wide experiments and the sequencing of multiple genomes, the analysis of large data sets has become commonplace in biology. It is often the case that thousands of features in a genome-wide data set are tested against some null hypothesis, where many features are expected to be significant. Here we propose an approach to statistical significance in the analysis of genome-wide data sets, based on the concept of the false discovery rate. This approach offers a sensible balance between the number of true findings and the number of false positives that is automatically calibrated and easily interpreted. In doing so, a measure of statistical significance called the q-value is associated with each tested feature in addition to the traditional p-value. Our approach avoids a flood of false positive results, while offering a more liberal criterion than what has been used in genome scans for linkage.

201 citations


Book ChapterDOI
01 Jan 2003
TL;DR: The SAM methodology works in the context of a general approach to detecting differential gene expression in DNA microarrays and some recently developed methodology for estimating false discovery rates and q-values has been included in the software.
Abstract: SAM is a computer package for correlating gene expression with an outcome parameter such as treatment, survival time, or diagnostic class. It thresholds an appropriate test statistic and reports the q-value of each test based on a set of sample permutations. SAM works as a Microsoft Excel add-in and has additional features for fold-change thresholding and block permutations. Here, we explain how the SAM methodology works in the context of a general approach to detecting differential gene expression in DNA microarrays. Some recently developed methodology for estimating false discovery rates and q-values has been included in the SAM software, which we summarize here.

165 citations


Journal ArticleDOI
TL;DR: The case-crossover design used to analyse the protective effect of recent convictions on individual drivers found the benefit was greater for speeding violations with penalty points than speeding violations without points; was no different for crashes of differing severity; and was not seen in drivers whose licences were suspended.

149 citations


Journal ArticleDOI
15 Jan 2003-Blood
TL;DR: Patients with DLBCL expressing high levels of HGAL mRNA demonstrate significantly longer overall survival than do patients with low HGAL expression, independent of the clinical international prognostic index.

Book ChapterDOI
26 Mar 2003
TL;DR: The authors begin with an example that will be used throughout the chapter of breast carcinomas based on variations in gene expression derived from complementary deoxyribonucleic acid (cDNA) microarrays and to correlate tumor characteristics to clinical outcome.
Abstract: We begin with an example that will be used throughout the chapter.The data come from Sorlie et al. (2001). The goal of that article was to “classify breast carcinomas based on variations in gene expression derived from complementary deoxyribonucleic acid (cDNA) microarrays and to correlate tumor characteristics to clinical outcome.’’ The data consist of log fluorescence values for 456 cDNA clones measured on 85 tissue samples. Of the 85 samples, 4 are normal tissue samples, 78 are carcinomas, and 3 are fibroadenomas. Three of the four normal tissue samples were pooled normal breast samples from multiple individuals. Sorlie et al. (2001) selected the 456 genes from an initial set of 8102 genes so as to optimally identify the intrinsic characteristics of breast tumors. In Figures 4.1 and 4.2, the data are plotted as heat maps.∗ This representation assigns a color for every matrix entry, with negative (underexpressed) values being green, and positive (overexpressed) values red. The data presented in this plot were preprocessed by Sorlie et al. (2001), adjusting rows and columns to have median zero. This preprocessing was applied before selection of the subset of 456 genes, so the column (i.e., sample) medians are not precisely zero.

01 Jan 2003
TL;DR: His results imply that boosting-like methods can reasonably be expected to converge to Bayes classifiers under sufficient regularity conditions (such as the requirement that trees with at least p+ 1 terminal nodes are used, where p is the number of variables in the model).
Abstract: We congratulate the authors for their interesting papers on boosting and related topics. Jiang deals with the asymptotic consistency of Adaboost. Lugosi and Vayatis study the convex optimization of loss functions associated with boosting. Zhang studies the loss functions themselves. Their results imply that boosting-like methods can reasonably be expected to converge to Bayes classifiers under sufficient regularity conditions (such as the requirement that trees with at least p+ 1 terminal nodes are used, where p is the number of variables in the model). An interesting feature of their results is that whenever data-based optimization is performed, some form of regularization is needed in order to attain consistency. In the case of AdaBoost this is achieved by stopping the boosting procedure early, whereas in the case of convex loss optimization, it is achieved by constraining the L1 norm of the coefficient vector. These results re-iterate, from this new perspective, the critical importance of regularization for building useful prediction models in high-dimensional space. This is also the theme of the remainder of our discussion. Since the publication of the AdaBoost procedure by Freund and Schapire in 1996, there has been a flurry of papers seeking to answer the question: why does boosting work? Since AdaBoost has been generalized in different ways by different authors, the question might be better posed as; what are the aspects of boosting that are the key to its good performance?

Journal ArticleDOI
TL;DR: A novel procedure called "nearest shrunken centroids" is described that has successfully detected clinically relevant genetic differences in cancer patients and has the potential to become a powerful tool for diagnosing and treating cancer.
Abstract: The morbidity rate of cancer victims varies greatly for similar patients who receive similar treatments. It is hypothesized that this variation can be explained by the genetic heterogeneity of the disease. DNA Microarrays, which can simultaneously measure the expression level of thousands of different genes, have been successfully used to identify such genetic differences. However, microarray data typically has a large number of features and relatively few observations, meaning that conventional machine learning tools can fail when applied to such data. We describe a novel procedure called "nearest shrunken centroids" that has successfully detected clinically relevant genetic differences in cancer patients. This procedure has the potential to become a powerful tool for diagnosing and treating cancer.

Journal ArticleDOI
TL;DR: While Cherkassky and Ma (2003) raise some interesting issues in comparing techniques for model selection, their article appears to be written largely in protest of comparisons made in the book, Elements of Statistical Learning (2001), CherkASSky andMa feel that the structural risk minimization (SRM) method is falsely represented.
Abstract: While Cherkassky and Ma (2003) raise some interesting issues in comparing techniques for model selection, their article appears to be written largely in protest of comparisons made in our book, Elements of Statistical Learning (2001). Cherkassky and Ma feel that we falsely represented the structural risk minimization (SRM) method, which they defend strongly here.In a two-page section of our book (pp. 212-213), we made an honest attempt to compare the SRM method with two related techniques, Aikaike information criterion (AIC) and Bayesian information criterion (BIC). Apparently, we did not apply SRM in the optimal way. We are also accused of using contrived examples, designed to make SRM look bad.Alas, we did introduce some careless errors in our original simulation-- errors that were corrected in the second and subsequent printings. Some of these errors were pointed out to us by Cherkassky and Ma (we supplied them with our source code), and as a result we replaced the assessment "SRM performs poorly overall" with a more moderate "the performance of SRM is mixed" (p. 212). These and other corrections can be seen in the errata section on-line at http://www-stat.stanford.edu/ElemStatLearn.


Patent
14 Oct 2003
TL;DR: In this article, an expression profile for the transcriptional response to a therapy is obtained from the patient and compared to a reference profile to determine whether the patient is susceptible to toxicity.
Abstract: Methods are provided for determining whether a patient treated with an anti-proliferative agent is susceptible to toxicity. In practicing the subject methods, an expression profile for the transcriptional response to a therapy is obtained from the patient and compared to a reference profile to determine whether the patient is susceptible to toxicity. In addition, reagents and kits thereof that find use in practicing the subject methods are provided.