scispace - formally typeset
Search or ask a question

Showing papers by "Robert Tibshirani published in 2001"



Journal ArticleDOI
TL;DR: A method that assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements is described, suggesting that this repair pathway for UV-damaged DNA might play a previously unrecognized role in repairing DNA damaged by ionizing radiation.
Abstract: Microarrays can measure the expression of thousands of genes to identify changes in expression between different biological states. Methods are needed to determine the significance of these changes while accounting for the enormous number of genes. We describe a method, Significance Analysis of Microarrays (SAM), that assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR). When the transcriptional response of human cells to ionizing radiation was measured by microarrays, SAM identified 34 genes that changed at least 1.5-fold with an estimated FDR of 12%, compared with FDRs of 60 and 84% by using conventional methods of analysis. Of the 34 genes, 19 were involved in cell cycle regulation and 3 in apoptosis. Surprisingly, four nucleotide excision repair genes were induced, suggesting that this repair pathway for UV-damaged DNA might play a previously unrecognized role in repairing DNA damaged by ionizing radiation.

12,102 citations


Journal ArticleDOI
TL;DR: Survival analyses on a subcohort of patients with locally advanced breast cancer uniformly treated in a prospective study showed significantly different outcomes for the patients belonging to the various groups, including a poor prognosis for the basal-like subtype and a significant difference in outcome for the two estrogen receptor-positive groups.
Abstract: The purpose of this study was to classify breast carcinomas based on variations in gene expression patterns derived from cDNA microarrays and to correlate tumor characteristics to clinical outcome. A total of 85 cDNA microarray experiments representing 78 cancers, three fibroadenomas, and four normal breast tissues were analyzed by hierarchical clustering. As reported previously, the cancers could be classified into a basal epithelial-like group, an ERBB2-overexpressing group and a normal breast-like group based on variations in gene expression. A novel finding was that the previously characterized luminal epithelial/estrogen receptor-positive group could be divided into at least two subgroups, each with a distinctive expression profile. These subtypes proved to be reasonably robust by clustering using two different gene sets: first, a set of 456 cDNA clones previously selected to reflect intrinsic properties of the tumors and, second, a gene set that highly correlated with patient outcome. Survival analyses on a subcohort of patients with locally advanced breast cancer uniformly treated in a prospective study showed significantly different outcomes for the patients belonging to the various groups, including a poor prognosis for the basal-like subtype and a significant difference in outcome for the two estrogen receptor-positive groups.

10,791 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a method called the "gap statistic" for estimating the number of clusters (groups) in a set of data, which uses the output of any clustering algorithm (e.g. K-means or hierarchical), comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution.
Abstract: We propose a method (the ‘gap statistic’) for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. K-means or hierarchical), comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that have been proposed in the literature.

4,283 citations


Journal ArticleDOI
TL;DR: It is shown that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVD Impute and KNN Impute surpass the commonly used row average method (as well as filling missing values with zeros).
Abstract: Motivation: Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data. Results: We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1–20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions. Availability: The software is available at http://smi-web.

3,542 citations



Journal ArticleDOI
TL;DR: A simple nonparametric empirical Bayes model is introduced, which is used to guide the efficient reduction of the data to a single summary statistic per gene, and also to make simultaneous inferences concerning which genes were affected by the radiation.
Abstract: Microarrays are a novel technology that facilitates the simultaneous measurement of thousands of gene expression levels. A typical microarray experiment can produce millions of data points, raising serious problems of data reduction, and simultaneous inference. We consider one such experiment in which oligonucleotide arrays were employed to assess the genetic effects of ionizing radiation on seven thousand human genes. A simple nonparametric empirical Bayes model is introduced, which is used to guide the efficient reduction of the data to a single summary statistic per gene, and also to make simultaneous inferences concerning which genes were affected by the radiation. Although our focus is on one specific experiment, the proposed methods can be applied quite generally. The empirical Bayes inferences are closely related to the frequentist false discovery rate (FDR) criterion.

1,868 citations


Journal ArticleDOI
15 Aug 2001-Blood
TL;DR: High BCL-6 mRNA expression should be considered a new favorable prognostic factor in DLBCL and should be used in the stratification and the design of risk-adjusted therapies for patients withDLBCL.

286 citations


01 Jan 2001
TL;DR: The singular value decomposition offers an interesting and stable method for imputation of missing values in gene expression arrays by regressing its non-missing entries on the eigen-genes and using the regression function to predict the expression values at the missing locations.
Abstract: The singular value decomposition offers an interesting and stable method for imputation of missing values in gene expression arrays. The basic paradigm is • Learn a set of basis functions or eigen-genes from the complete data. • Impute the missing cells for a gene by regressing its non-missing entries on the eigen-genes, and use the regression function to predict the expression values at the missing locations. ∗Depts. of Statistics, and Health, Research & Policy, Sequoia Hall, Stanford Univ., CA 94305. hastie@stat.stanford.edu †Depts. of Health, Research & Policy, and Statistics, Stanford Univ, tibs@stat.stanford.edu ‡Life Sciences Division, Lawrence Orlando Berkeley National Labs & Dept. of Molecular. and Cell Biology, University of California. Berk.; eisen@genome.stanford.edu; §Department of Biochemistry, Stanford University;pbrown@cmgm.stanford.edu ¶Department of Genetics, Stanford University;botstein@genome.stanford.edu

235 citations


Journal ArticleDOI
TL;DR: It is found that the procedure may require a large number of experimental samples to successfully discover interactions, and is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worthy of further investigation.
Abstract: We propose a new method for supervised learning from gene expression data. We call it 'tree harvesting'. This technique starts with a hierarchical clustering of genes, then models the outcome variable as a sum of the average expression profiles of chosen clusters and their products. It can be applied to many different kinds of outcome measures such as censored survival times, or a response falling in two or more classes (for example, cancer classes). The method can discover genes that have strong effects on their own, and genes that interact with other genes. We illustrate the method on data from a lymphoma study, and on a dataset containing samples from eight different cancers. It identified some potentially interesting gene clusters. In simulation studies we found that the procedure may require a large number of experimental samples to successfully discover interactions. Tree harvesting is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worthy of further investigation.

229 citations


Book ChapterDOI
01 Jan 2001
TL;DR: The first three examples described in Chapter 1 have several components in common, for each there is a set of variables that might be denoted as inputs, which are measured or preset.
Abstract: The first three examples described in Chapter 1 have several components in common For each there is a set of variables that might be denoted as inputs, which are measured or preset These have some influence on one or more outputs For each example the goal is to use the inputs to predict the values of the outputs This exercise is called supervised learning

Patent
19 Mar 2001
TL;DR: Significance analysis of microarrays (SAM) as mentioned in this paper assigns a score to each gene based on the change in gene expression relative to the standard deviation of repeated measurements, and uses permutations of the repeated measurements to estimate the percentage of such genes identified by chance, the false discovery rate.
Abstract: Microarrays can measure the expression of thousands of genes and thus identify changes in expression between different biological states. Methods are needed to determine the significance of these changes, while accounting for the enormous number of genes. We describe a new method, Significance Analysis of Microarrays (SAM), that assigns a score to each gene based on the change in gene expression relative to the standard deviation of repeated measurements. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of such genes identified by chance, the false discovery rate (FDR). When the transcriptional response of human cells to ionizing radiation was measured by microarrays, SAM identified 34 genes that changed at least 1.5-fold with an estimated FDR of 12%, compared to FDRs of 60% and 84% using conventional methods of analysis. Of the 34 genes, 19 were involved in cell cycle regulation, and 3 in apoptosis. Surprisingly, 4 nucleotide excision repair genes were induced, suggesting that this repair pathway for UV-damaged DNA might play a heretofore unrecognized role in repairing DNA damaged by ionizing radiation.

Book ChapterDOI
01 Jan 2001
TL;DR: In this article, a linear regression model is used to model the transformation of the input to the output of a linear model, which is called basis function method (BFP) and can be applied to transformations of inputs.
Abstract: A linear regression model assumes that the regression function E(Y|X) is linear in the inputs X 1,..., X p . Linear models were largely developed in the precomputer age of statistics, but even in today’s computer era there are still good reasons to study and use them. They are simple and often provide an adequate and interpretable description of how the inputs affect the output. For prediction purposes they can sometimes outperform fancier nonlinear models, especially in situations with small numbers of training cases, low signal-to-noise ratio or sparse data. Finally, linear methods can be applied to transformations of the inputs and this considerably expands their scope. These generalizations are sometimes called basis-function methods, and are discussed in Chapter 5.

Book ChapterDOI
01 Jan 2001
TL;DR: This article is reproduced from the previous edition, volume 19, pp. 13216–13220, with an updated Bibliography section supplied by the Editor.
Abstract: This article is reproduced from the previous edition, volume 19, pp 13216–13220, © 2001, Elsevier Ltd, with an updated Bibliography section supplied by the Editor

Book ChapterDOI
01 Jan 2001
TL;DR: This chapter revisits the classification problem and focuses on linear methods for classification, which means that the boundaries of these regions can be rough or smooth, depending on the prediction function.
Abstract: In this chapter we revisit the classification problem and focus on linear methods for classification. Since our predictor G(x) takes values in a discrete set G, we can always divide the input space into a collection of regions labeled according to the classification. We saw in Chapter 2 that the boundaries of these regions can be rough or smooth, depending on the prediction function. For an important class of procedures, these decision boundaries are linear; this is what we will mean by linear methods for classification.

Journal Article
TL;DR: On February 13, 1997, the lead article in a widely cited medical journal was published, in which the authors reported an association between cellular telephone calls and motor vehicle collisions.
Abstract: On February 13, 1997, we published the lead article in a widely cited medical journal, in which we reported an association between cellular telephone calls and motor vehicle collisions.[1][1] During that week we participated in more than 50 media interviews because we think scientists have an

Book ChapterDOI
01 Jan 2001
TL;DR: In this paper, it is shown that the true function f(X) = E(Y|X) will typically be nonlinear and nonadditive in X, and representation by a linear model is usually a convenient, and sometimes a necessary, approximation.
Abstract: We have already made use of models linear in the input features, both for regression and classification. Linear regression, linear discriminant analysis, logistic regression and separating hyperplanes all rely on a linear model. It is extremely unlikely that the true function f(X) is actually linear in X. In regression problems, f(X) = E(Y|X) will typically be nonlinear and nonadditive in X, and representing f(X) by a linear model is usually a convenient, and sometimes a necessary, approximation. Convenient because a linear model is easy to interpret, and is the first-order Taylor approximation to f(X). Sometimes necessary, because with N small and/or p large, a linear model might be all we are able to fit to the data without overfitting. Likewise in classification, a linear, Bayes-optimal decision boundary implies that some monotone transformation of Pr(Y = 1|X) is linear in X. This is inevitably an approximation.

Journal ArticleDOI
TL;DR: In this paper, the authors developed a new technique to estimate the integral of the distribution of T2 relaxation time without imposing any constraint other than the monotonicity of the underlying cumulative relaxation time distribution.
Abstract: Magnetic resonance imaging techniques can be used to measure some biophysical properties of tissue. In this context, the T2 relaxation time is an important parameter for soft-tissue contrast. The authors develop a new technique to estimate the integral of the distribution of T2 relaxation time without imposing any constraint other than the monotonicity of the underlying cumulative relaxation time distribution. They explore the properties of the estimation and its applications for the analysis of breast tissue data. As they show, an extension of linear discriminant analysis is found to distinguish well between two classes of breast tissue. Estimation de la loi du temps de decontraction en imagerie par resonance magnetique Les techniques d'imagerie par resonance magnetique permettent de mesurer certaines proprietes biophysiques des tissus. Dans ce contexte, le temps de decontraction T2 est un parametre important pour l'identification des tissus mous. Les auteurs proposent une nouvelle technique d'estimation de l'integrate de la loi du temps de decontraction T2 sans imposer d'autres contraintes que la monotonicite de la fonction de repartition de la variable sous-jacente. Ils explorent les proprietes de l'estimateur et montrent son utilite dans l'analyse de tissus mammaires. Comme ils le font valoir, une generalisation de l'analyse discriminante lineaire permet de distinguer nettement entre deux types de tissus mammaires.