scispace - formally typeset
Search or ask a question

Showing papers by "Robert Tibshirani published in 2012"


Journal ArticleDOI
TL;DR: This work proposes strong rules for discarding predictors in lasso regression and related problems, that are very simple and yet screen out far more predictors than the SAFE rules, and derives conditions under which they are foolproof.
Abstract: Summary. We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui and his colleagues have proposed ‘SAFE’ rules, based on univariate inner products between each predictor and the outcome, which guarantee that a coefficient will be 0 in the solution vector. This provides a reduction in the number of variables that need to be entered into the optimization. We propose strong rules that are very simple and yet screen out far more predictors than the SAFE rules. This great practical improvement comes at a price: the strong rules are not foolproof and can mistakenly discard active predictors, i.e. predictors that have non-zero coefficients in the solution. We therefore combine them with simple checks of the Karush–Kuhn–Tucker conditions to ensure that the exact solution to the convex problem is delivered. Of course, any (approximate) screening method can be combined with the Karush–Kuhn–Tucker conditions to ensure the exact solution; the strength of the strong rules lies in the fact that, in practice, they discard a very large number of the inactive predictors and almost never commit mistakes. We also derive conditions under which they are foolproof. Strong rules provide substantial savings in computational time for a variety of statistical optimization problems.

607 citations


Journal ArticleDOI
25 May 2012-PLOS ONE
TL;DR: It is observed that the preclinical phase of RA is characterized by an accumulation of multiple autoantibody specificities reflecting the process of epitope spread, and a biomarker profile including autoantsibodies and cytokines which predicts the imminent onset of clinical arthritis is identified.
Abstract: Rheumatoid arthritis (RA) is a prototypical autoimmune arthritis affecting nearly 1% of the world population and is a significant cause of worldwide disability. Though prior studies have demonstrated the appearance of RA-related autoantibodies years before the onset of clinical RA, the pattern of immunologic events preceding the development of RA remains unclear. To characterize the evolution of the autoantibody response in the preclinical phase of RA, we used a novel multiplex autoantigen array to evaluate development of the anti-citrullinated protein antibodies (ACPA) and to determine if epitope spread correlates with rise in serum cytokines and imminent onset of clinical RA. To do so, we utilized a cohort of 81 patients with clinical RA for whom stored serum was available from 1–12 years prior to disease onset. We evaluated the accumulation of ACPA subtypes over time and correlated this accumulation with elevations in serum cytokines. We then used logistic regression to identify a profile of biomarkers which predicts the imminent onset of clinical RA (defined as within 2 years of testing). We observed a time-dependent expansion of ACPA specificity with the number of ACPA subtypes. At the earliest timepoints, we found autoantibodies targeting several innate immune ligands including citrullinated histones, fibrinogen, and biglycan, thus providing insights into the earliest autoantigen targets and potential mechanisms underlying the onset and development of autoimmunity in RA. Additionally, expansion of the ACPA response strongly predicted elevations in many inflammatory cytokines including TNF-α, IL-6, IL-12p70, and IFN-γ. Thus, we observe that the preclinical phase of RA is characterized by an accumulation of multiple autoantibody specificities reflecting the process of epitope spread. Epitope expansion is closely correlated with the appearance of preclinical inflammation, and we identify a biomarker profile including autoantibodies and cytokines which predicts the imminent onset of clinical arthritis.

415 citations


Journal ArticleDOI
TL;DR: A precise characterization of the effect of this hierarchy constraint is given, a bound on this estimate reveals the amount of fitting "saved" by the hierarchy constraint, and it is proved that hierarchy holds with probability one.
Abstract: We add a set of convex constraints to the lasso to produce sparse interaction models that honor the hierarchy restriction that an interaction only be included in a model if one or both variables are marginally important. We give a precise characterization of the effect of this hierarchy constraint, prove that hierarchy holds with probability one and derive an unbiased estimate for the degrees of freedom of our estimator. A bound on this estimate reveals the amount of fitting "saved" by the hierarchy constraint. We distinguish between parameter sparsity - the number of nonzero coefficients - and practical sparsity - the number of raw variables one must measure to make a new prediction. Hierarchy focuses on the latter, which is more closely tied to important data collection concerns such as cost, time and effort. We develop an algorithm, available in the R package hierNet, and perform an empirical study of our method.

355 citations


Journal ArticleDOI
TL;DR: This work uses a log-linear model with a new approach to normalization to derive a novel procedure to estimate the false discovery rate (FDR), and demonstrates that the method has potential advantages over existing methods that are based on a Poisson or negative binomial model.
Abstract: We discuss the identification of genes that are associated with an outcome in RNA sequencing and other sequence-based comparative genomic experiments. RNA-sequencing data take the form of counts, so models based on the Gaussian distribution are unsuitable. Moreover, normalization is challenging because different sequencing experiments may generate quite different total numbers of reads. To overcome these difficulties, we use a log-linear model with a new approach to normalization. We derive a novel procedure to estimate the false discovery rate (FDR). Our method can be applied to data with quantitative, two-class, or multiple-class outcomes, and the computation is fast even for large data sets. We study the accuracy of our approaches for significance calculation and FDR estimation, and we demonstrate that our method has potential advantages over existing methods that are based on a Poisson or negative binomial model. In summary, this work provides a pipeline for the significance analysis of sequencing data.

325 citations


Journal ArticleDOI
TL;DR: This study provides the first large survey of long non-coding RNA expression within a panel of solid cancers and also identifies a number of novel transcribed regions differentially expressed across distinct cancer types that represent candidate biomarkers for future research.
Abstract: Molecular characterization of tumors has been critical for identifying important genes in cancer biology and for improving tumor classification and diagnosis. Long non-coding RNAs, as a new, relatively unstudied class of transcripts, provide a rich opportunity to identify both functional drivers and cancer-type-specific biomarkers. However, despite the potential importance of long non-coding RNAs to the cancer field, no comprehensive survey of long non-coding RNA expression across various cancers has been reported. We performed a sequencing-based transcriptional survey of both known long non-coding RNAs and novel intergenic transcripts across a panel of 64 archival tumor samples comprising 17 diagnostic subtypes of adenocarcinomas, squamous cell carcinomas and sarcomas. We identified hundreds of transcripts from among the known 1,065 long non-coding RNAs surveyed that showed variability in transcript levels between the tumor types and are therefore potential biomarker candidates. We discovered 1,071 novel intergenic transcribed regions and demonstrate that these show similar patterns of variability between tumor types. We found that many of these differentially expressed cancer transcripts are also expressed in normal tissues. One such novel transcript specifically expressed in breast tissue was further evaluated using RNA in situ hybridization on a panel of breast tumors. It was shown to correlate with low tumor grade and estrogen receptor expression, thereby representing a potentially important new breast cancer biomarker. This study provides the first large survey of long non-coding RNA expression within a panel of solid cancers and also identifies a number of novel transcribed regions differentially expressed across distinct cancer types that represent candidate biomarkers for future research.

220 citations


Journal ArticleDOI
TL;DR: By probing yeast RNA structures at different temperatures, relative melting temperatures (Tm) are obtained for RNA structures in over 4000 transcripts and specific signatures of RNA Tm demarcated the polarity of mRNA open reading frames and highlighted numerous candidate regulatory RNA motifs in 3' untranslated regions.

191 citations


Journal ArticleDOI
12 Jan 2012-Blood
TL;DR: The in situ vaccination strategy that combines local radiation to enhance tumor immunogenicity with the injection into the tumor of a TLR9 agonist is explored in a second disease: mycosis fungoides.

189 citations


Journal ArticleDOI
TL;DR: This paper discusses a method for selecting prototypes in the classification setting (in which the samples fall into known discrete categories), and demonstrates the interpretative value of producing prototypes on the well-known USPS ZIP code digits data set and shows that as a classifier it performs reasonably well.
Abstract: Prototype methods seek a minimal subset of samples that can serve as a distillation or condensed view of a data set. As the size of modern data sets grows, being able to present a domain specialist with a short list of "representative" samples chosen from the data set is of increasing interpretative value. While much recent statistical research has been focused on producing sparse-in-the-variables methods, this paper aims at achieving sparsity in the samples. We discuss a method for selecting prototypes in the classification setting (in which the samples fall into known discrete categories). Our method of focus is derived from three basic properties that we believe a good prototype set should satisfy. This intuition is translated into a set cover optimization problem, which we solve approximately using standard approaches. While prototype selection is usually viewed as purely a means toward building an efficient classifier, in this paper we emphasize the inherent value of having a set of prototypical elements. That said, by using the nearest-neighbor rule on the set of prototypes, we can of course discuss our method as a classifier as well.

180 citations


Journal ArticleDOI
TL;DR: Physicians' warnings to patients who are potentially unfit to drive may contribute to a decrease in subsequent trauma from road crashes, yet they may also exacerbate mood disorders and compromise the doctor-patient relationship.
Abstract: Background Physicians' warnings to patients who are potentially unfit to drive are a medical intervention intended to prevent trauma from motor vehicle crashes. We assessed the association between medical warnings and the risk of subsequent road crashes. Methods We identified consecutive patients who received a medical warning in Ontario, Canada, between April 1, 2006, and December 31, 2009, from a physician who judged them to be potentially unfit to drive. We excluded patients who were younger than 18 years of age, who were not residents of Ontario, or who lacked valid health-card numbers under universal health insurance. We analyzed emergency department visits for road crashes during a baseline interval before the warning and a subsequent interval after the warning. Results A total of 100,075 patients received a medical warning from a total of 6098 physicians. During the 3-year baseline interval, there were 1430 road crashes in which the patient was a driver and presented to the emergency department, as...

133 citations


Journal ArticleDOI
TL;DR: The efficacy of this method- the "standardized Group Lasso"- over the usual group lasso on real and simulated data sets is demonstrated and it is shown that it is intimately related to the uniformly most powerful invariant test for inclusion of a group.
Abstract: We re-examine the original Group Lasso paper of Yuan and Lin (2007). The form of penalty in that paper seems to be designed for problems with uncorrelated features, but the statistical community has adopted it for general problems with correlated features. We show that for this general situation, a Group Lasso with a different choice of penalty matrix is generally more effective. We give insight into this formulation and show that it is intimately related to the uniformly most powerful invariant test for inclusion of a group. We demonstrate the efficacy of this method- the "standardized Group Lasso"- over the usual group lasso on real and simulated data sets. We also extend this to the Ridged Group Lasso to provide within group regularization as needed. We discuss a simple algorithm based on group-wise coordinate descent to fit both this standardized Group Lasso and Ridged Group Lasso.

130 citations


Posted Content
TL;DR: It is shown that coupled with an efficiency augmentation procedure, this method produces clinically meaningful estimators in a variety of settings and can be useful for practicing personalized medicine: determining from a large set of biomarkers, the subset of patients that can potentially benefit from a treatment.
Abstract: We consider a setting in which we have a treatment and a large number of covariates for a set of observations, and wish to model their relationship with an outcome of interest. We propose a simple method for modeling interactions between the treatment and covariates. The idea is to modify the covariate in a simple way, and then fit a standard model using the modified covariates and no main effects. We show that coupled with an efficiency augmentation procedure, this method produces valid inferences in a variety of settings. It can be useful for personalized medicine: determining from a large set of biomarkers the subset of patients that can potentially benefit from a treatment. We apply the method to both simulated datasets and gene expression studies of cancer. The modified data can be used for other purposes, for example large scale hypothesis testing for determining which of a set of covariates interact with a treatment variable.

Journal ArticleDOI
TL;DR: In this article, the effects of row and column correlations on procedures for large-scale inference on the row or column variables of data in the form of a matrix are studied. But the authors focus on the problem of detecting significant genes in microarrays when the samples may be dependent because of latent variables or unknown batch effects.
Abstract: Summary. We consider the problem of large-scale inference on the row or column variables of data in the form of a matrix. Many of these data matrices are transposable meaning that neither the row variables nor the column variables can be considered independent instances. An example of this scenario is detecting significant genes in microarrays when the samples may be dependent because of latent variables or unknown batch effects. By modelling this matrix data by using the matrix variate normal distribution, we study and quantify the effects of row and column correlations on procedures for large-scale inference. We then propose a simple solution to the myriad of problems that are presented by unexpected correlations: we simultaneously estimate row and column covariances and use these to sphere or decorrelate the noise in the underlying data before conducting inference. This procedure yields data with approximately independent rows and columns so that test statistics more closely follow null distributions and multiple-testing procedures correctly control the desired error rates. Results on simulated models and real microarray data demonstrate major advantages of this approach: increased statistical power, less bias in estimating the false discovery rate and reduced variance of the false discovery rate estimators.

Journal ArticleDOI
TL;DR: A novel risk score of serum protein levels plus clinical risk factors, developed and validated in independent cohorts, demonstrated clinical utility for assessing the true risk of CHD events in intermediate risk patients.
Abstract: Background:Many coronary heart disease (CHD) events occur in individuals classified as intermediate risk by commonly used assessment tools. Over half the individuals presenting with a severe cardiac event, such as myocardial infarction (MI), have at most one risk factor as included in the widely used Framingham risk assessment. Individuals classified as intermediate risk, who are actually at high risk, may not receive guideline recommended treatments. A clinically useful method for accurately predicting 5-year CHD risk among intermediate risk patients remains an unmet medical need.Objective:This study sought to develop a CHD Risk Assessment (CHDRA) model that improves 5-year risk stratification among intermediate risk individuals.Methods:Assay panels for biomarkers associated with atherosclerosis biology (inflammation, angiogenesis, apoptosis, chemotaxis, etc.) were optimized for measuring baseline serum samples from 1084 initially CHD-free Marshfield Clinic Personalized Medicine Research Project ...

Posted Content
TL;DR: A permutation-based method for testing marginal interactions with a binary response that finds apparent signal and tells a believable story, while logistic regression does not and gives asymptotic consistency results under not too restrictive assumptions.
Abstract: To date, testing interactions in high dimensions has been a challenging task. Existing methods often have issues with sensitivity to modeling assumptions and heavily asymptotic nominal p-values. To help alleviate these issues, we propose a permutation-based method for testing marginal interactions with a binary response. Our method searches for pairwise correlations which differ between classes. In this manuscript, we compare our method on real and simulated data to the standard approach of running many pairwise logistic models. On simulated data our method finds more significant interactions at a lower false discovery rate (especially in the presence of main effects). On real genomic data, although there is no gold standard, our method finds apparent signal and tells a believable story, while logistic regression does not. We also give asymptotic consistency results under not too restrictive assumptions.

Journal ArticleDOI
TL;DR: In this paper, the authors consider the testing of all pairwise interactions in a two-class problem with many features and devise a hierarchical testing framework that considers an interaction only when one or more of its constituent features has a nonzero main effect.
Abstract: We consider the testing of all pairwise interactions in a two-class problem with many features. We devise a hierarchical testing framework that considers an interaction only when one or more of its constituent features has a nonzero main effect. The test is based on a convex optimization framework that seamlessly considers main effects and interactions together. We show - both in simulation and on a genomic data set from the SAPPHIRe study - a potential gain in power and interpretability over a standard (nonhierarchical) interaction test.

Journal ArticleDOI
TL;DR: It is proved that when log-transformed signal is used as the input for signal reconstruction, it will always yield an underestimation of the true signal.
Abstract: gene-expression values1. The authors mixed complementary RNA from the tissues and observed similar off-diagonal effects. They concluded that the off-diagonal effects are due to technical reasons, such as nonlinear sample amplification or probe crosshybridization, rather than statistical deconvolution. We found that this deviation of signal reconstruction was the result of data transformation. In microarray studies, expression data are logarithm-transformed for variance stabilization or for approximation of a normal distribution2. However, we argue that in the context of expression-profile deconvolution, the log transformation will produce biased estimation. Deconvolution is modeled by a linear equation O = S × W, where O is the expression data for mixed tissue samples, S is the tissue-specific expression profile, and W is the cell-type frequency matrix. If the signal is log-transformed, the linearity will no longer be preserved. The concavity feature of the log function will induce a downward bias to the reconstructed signal (Fig. 1a and Supplementary Fig. 1). Mathematically, it can be shown that the deconvolution model used on log-transformed signals is log(O ́) = log(S) × W, where O ́ is the csSAM estimate of gene-expression profiles. As W is a frequency matrix and its column values sum to 1, the following is true by the properties of concave functions3: log(S × W) > log(S) × W. Taking these two equations together, we can conclude that log(O ́) < log(S × W) = log(O). Thus, we proved that when log-transformed signal is used as the input for signal reconstruction, it will always yield an underestimation of the true signal. By taking an anti-log transformation, we obtained an unbiased reconstruction of the mixed tissue samples (Fig. 1b and Supplementary Fig. 2). The log transformation also introduced a large bias to the results of deconvolution (Fig. 1c and Supplementary Fig. 3). A substantial portion of the genes were off diagonal in the deconvolved cell type–specif ic gene-expression profiles. By performing the deconvolution in linear space, we achieved a considerably more accurate result (Fig. 1d and Supplementary Fig. 3). In summary, an incorrect transformation of data can greatly bias the final results of deconvolution. In the context of geneexpression deconvolution, a linear model achieves better accuracy. Accurate deconvolution of expression profiles is important for downstream analysis, such as gene expression analysis and pathway-enrichment analysis. We urge caution in selecting datatransformation functions and any preprocessing steps in modelbased statistical analysis.

Posted Content
06 Nov 2012
TL;DR: A hierarchical testing framework that considers an interaction only when one or more of its constituent features has a nonzero main effect is devised, based on a convex optimization framework that seamlessly considers main effects and interactions together.
Abstract: We consider the testing of all pairwise interactions in a two-class problem with many features. We devise a hierarchical testing framework that considers an interaction only when one or more of its constituent features has a nonzero main effect. The test is based on a convex optimization framework that seamlessly considers main effects and interactions together. We show - both in simulation and on a genomic data set from the SAPPHIRe study - a potential gain in power and interpretability over a standard (nonhierarchical) interaction test.