scispace - formally typeset
Search or ask a question

Showing papers by "Robert Tibshirani published in 2000"


Journal ArticleDOI
03 Feb 2000-Nature
TL;DR: It is shown that there is diversity in gene expression among the tumours of DLBCL patients, apparently reflecting the variation in tumour proliferation rate, host response and differentiation state of the tumour.
Abstract: 12 Pathology and Microbiology, and 13 Diffuse large B-cell lymphoma (DLBCL), the most common subtype of non-Hodgkin's lymphoma, is clinically heterogeneous: 40% of patients respond well to current therapy and have prolonged survival, whereas the remainder succumb to the disease. We proposed that this variability in natural history reflects unrecognized molecular heterogeneity in the tumours. Using DNA microarrays, we have conducted a systematic characterization of gene expression in B-cell malignancies. Here we show that there is diversity in gene expression among the tumours of DLBCL patients, apparently reflecting the variation in tumour proliferation rate, host response and differentiation state of the tumour. We identified two molecularly distinct forms of DLBCL which had gene expression patterns indicative of different stages of B-cell differentiation. One type expressed genes characteristic of germinal centre B cells ('germinal centre B-like DLBCL'); the second type expressed genes normally induced during in vitro activation of peripheral blood B cells ('activated B-like DLBCL'). Patients with germinal centre B-like DLBCL had a significantly better overall survival than those with activated B-like DLBCL. The molecular classification of tumours on the basis of gene expression can thus identify previously undetected and clinically significant subtypes of cancer.

9,493 citations


Journal ArticleDOI
TL;DR: This work shows that this seemingly mysterious phenomenon of boosting can be understood in terms of well-known statistical principles, namely additive modeling and maximum likelihood, and develops more direct approximations and shows that they exhibit nearly identical results to boosting.
Abstract: Boosting is one of the most important recent developments in classification methodology. Boosting works by sequentially applying a classification algorithm to reweighted versions of the training data and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. We show that this seemingly mysterious phenomenon can be understood in terms of well-known statistical principles, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Direct multiclass generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multiclass generalizations of boosting in most situations, and far superior in some. We suggest a minor modification to boosting that can reduce computation, often by factors of 10 to 50. Finally, we apply these insights to produce an alternative formulation of boosting decision trees. This approach, based on best-first truncated tree induction, often leads to better performance, and can provide interpretable descriptions of the aggregate decision rule. It is also much faster computationally, making it more suitable to large-scale data mining applications.

6,598 citations


Book
01 Jan 2000
TL;DR: The gap statistic is proposed for estimating the number of clusters (groups) in a set of data by comparing the change in within‐cluster dispersion with that expected under an appropriate reference null distribution.
Abstract: We propose a method (the ‘gap statistic’) for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. K-means or hierarchical), comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that have been proposed in the literature.

3,860 citations


Journal ArticleDOI
TL;DR: The gene shaving method is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worth further investigation.
Abstract: Large gene expression studies, such as those conducted using DNA arrays, often provide millions of different pieces of data. To address the problem of analyzing such data, we describe a statistical method, which we have called 'gene shaving'. The method identifies subsets of genes with coherent expression patterns and large variation across conditions. Gene shaving differs from hierarchical clustering and other widely used methods for analyzing gene expression studies in that genes may belong to more than one cluster, and the clustering may be supervised by an outcome measure. The technique can be 'unsupervised', that is, the genes and samples are treated as unlabeled, or partially or fully supervised by using known properties of the genes or samples to assist in finding meaningful groupings. We illustrate the use of the gene shaving method to analyze gene expression measurements made on samples from patients with diffuse large B-cell lymphoma. The method identifies a small cluster of genes whose expression is highly predictive of survival. The gene shaving method is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worth further investigation.

618 citations


Journal ArticleDOI
TL;DR: Side-by-side application of multinomial and binomial models on 86 previously established Ig sequences disclosed 8 discrepancies, leading to opposite statistical conclusions about Ag selection.
Abstract: Analysis of somatic mutations in V regions of Ig genes is important for understanding various biological processes. It is customary to estimate Ag selection on Ig genes by assessment of replacement (R) as opposed to silent (S) mutations in the complementary-determining regions and S as opposed to R mutations in the framework regions. In the past such an evaluation was performed using a binomial distribution model equation, which is inappropriate for Ig genes in which mutations have four different distribution possibilities (R and S mutations in the complementary-determining region and/or framework regions of the gene). In the present work, we propose a multinomial distribution model for assessment of Ag selection. Side-by-side application of multinomial and binomial models on 86 previously established Ig sequences disclosed 8 discrepancies, leading to opposite statistical conclusions about Ag selection. We suggest the use of the multinomial model for all future analysis of Ag selection.

191 citations


Journal ArticleDOI
TL;DR: In this paper, a general procedure for posterior sampling from additive and generalized additive models is proposed, which is a stochastic generalization of the well-known backfitting algorithm for fitting additive models.
Abstract: We propose general procedures for posterior sampling from additive and generalized additive models. The procedure is a stochastic generalization of the well-known backfitting algorithm for fitting additive models. One chooses a linear operator (“smoother”) for each predictor, and the algorithm requires only the application of the operator and its square root. The procedure is general and modular, and we describe its application to nonparametric, semiparametric and mixed models.

138 citations


Journal ArticleDOI
01 Mar 2000-Blood
TL;DR: The findings indicate that the etiology and the driving forces for clonal expansion are heterogeneous, which may explain the well-known clinical and pathologic heterogeneity of DLBCL.

137 citations





Journal ArticleDOI
01 Jun 2000-Chance
TL;DR: It is shown that if people can accurately judge if they are in a lane that is slower than the next lane on a congested roadway, such an expectation is mistaken when time is measured by discrete intervals and is popularized in an exceedingly short paper.
Abstract: Motor-vehicle travel is a mixed blessing for modern times. During the average day in the United States, for example, about 100 people step into avehicle and do not emerge alive according to data from the 1996 Statistical Abstract of the United States, published by the Bureau of the Census. Crashes are especially poignant if they kill healthy people who otherwise might have led long and productive 1ives.A 1957 New England]ournu1 OfMedicine study found that crashes are usually (>90%) attributed to driver error rather than failures in the vehicle or roadway. The most important factor in driver error is alcohol, contributing to about 40% of fatal collisions in 1994 according to a National Highway Traffic Safety Administration report. The other factors causing driver error are not completely known. A better understanding of such errors might allow people to benefit from motor vehicle travel at a lower personal risk. One potential driver error is an inappropriate lane change, a vehicle maneuver that may have substantial risks for several reasons. First, it causes the individual to straddle traffic flows and be exposed to two streams of vehicles. Second, it requires the driver to make rapid judgments about sufficient spacing. Third, it increases the hazard related to other vehicles approaching along the driver's blind spot. Fourth, it disrupts the traffic pattern for following vehicles. The overall risks associated with each lane change are uncertain because the amount of normal driving spent making lane changes is not known with precision; however, rough estimates in an Ontario Ministry of Transportation report suggest about a threefold relative risk if less than 1% of normal driving involves a lane change. We wondered whether people can accurately judge if they are in a lane that is slower than the next lane on a congested roadway. Mistaken impressions, for example, might cause a driver to incorrectly think the next lane is faster and motivate a needless lane change. Perhaps errors in judgment produce a systematic bias and create an illusion that the next lane is generally moving faster, even if all lanes have the same average speed. One basis for such error is if drivers expect that they should spend equal amounts of time passing and being overtaken. We have shown that such an expectation is mistaken when time is measured by discrete intervals. We recently popularized this finding in an exceedingly short paper

Journal ArticleDOI
TL;DR: Risk factors of delayed extubation, prolonged intensive care unit (ICU) length of stay (LOS), and mortality have not been studied for patients administered fast-track cardiac anesthesia (FTCA) and cardiac risk scores (CRS) were determined.
Abstract: BackgroundRisk factors of delayed extubation, prolonged intensive care unit (ICU) length of stay (LOS), and mortality have not been studied for patients administered fast-track cardiac anesthesia (FTCA). The authors’ goals were to determine risk factors of outcomes and cardiac risk scores (CRS) for