Showing papers by "Robert Tibshirani published in 2021"

PDF

Open Access

Book•

An Introduction to Statistical Learning: with Applications in R

[...]

Gareth M. James¹, Daniela Witten², Trevor Hastie³, Robert Tibshirani³•Institutions (3)

University of Southern California¹, University of Washington², Stanford University³

29 Jul 2021

TL;DR: This book presents some of the most important modeling and prediction techniques, along with relevant applications, that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years.

...read moreread less

Abstract: An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform. Two of the authors co-wrote The Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2nd edition 2009), a popular reference book for statistics and machine learning researchers. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use cutting-edge statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra.

...read moreread less

3,439 citations

Journal Article•DOI•

Genetics of 35 blood and urine biomarkers in the UK Biobank

[...]

Nasa Sinnott-Armstrong¹, Nasa Sinnott-Armstrong², Nasa Sinnott-Armstrong³, Yosuke Tanigawa¹, David Amar⁴, David Amar¹, Nina Mars², Christian Benner², Matthew Aguirre¹, Guhan Venkataraman¹, Michael Wainberg¹, Hanna Ollila¹, Hanna Ollila², Hanna Ollila⁵, Tuomo Kiiskinen², Tuomo Kiiskinen⁶, Aki S. Havulinna⁶, Aki S. Havulinna², James P. Pirruccello⁵, James P. Pirruccello⁷, Junyang Qian¹, Anna Shcherbina², Anna Shcherbina⁴, FinnGen⁴, Fatima Rodriguez⁴, Themistocles L. Assimes⁴, Themistocles L. Assimes³, Vineeta Agarwala⁴, Robert Tibshirani¹, Trevor Hastie¹, Samuli Ripatti², Samuli Ripatti⁷, Jonathan K. Pritchard¹, Mark J. Daly⁷, Mark J. Daly⁵, Mark J. Daly², Manuel A. Rivas¹ - Show less +33 more•Institutions (7)

Stanford University¹, University of Helsinki², VA Palo Alto Healthcare System³, Cardiovascular Institute of the South⁴, Harvard University⁵, National Institute for Health and Welfare⁶, Broad Institute⁷

18 Jan 2021-Nature Genetics

TL;DR: In this article, the genetic basis of 35 blood and urine laboratory measurements in the UK Biobank (n = 363,228 individuals) was evaluated and the results delineate the genetic underlying of biomarkers and their causal influences on diseases and improve genetic risk stratification for common diseases.

...read moreread less

Abstract: Clinical laboratory tests are a critical component of the continuum of care. We evaluate the genetic basis of 35 blood and urine laboratory measurements in the UK Biobank (n = 363,228 individuals). We identify 1,857 loci associated with at least one trait, containing 3,374 fine-mapped associations and additional sets of large-effect (>0.1 s.d.) protein-altering, human leukocyte antigen (HLA) and copy number variant (CNV) associations. Through Mendelian randomization (MR) analysis, we discover 51 causal relationships, including previously known agonistic effects of urate on gout and cystatin C on stroke. Finally, we develop polygenic risk scores (PRSs) for each biomarker and build 'multi-PRS' models for diseases using 35 PRSs simultaneously, which improved chronic kidney disease, type 2 diabetes, gout and alcoholic cirrhosis genetic risk stratification in an independent dataset (FinnGen; n = 135,500) relative to single-disease PRSs. Together, our results delineate the genetic basis of biomarkers and their causal influences on diseases and improve genetic risk stratification for common diseases.

...read moreread less

262 citations

Journal Article•DOI•

An inflammatory aging clock (iAge) based on deep learning tracks multimorbidity, immunosenescence, frailty and cardiovascular aging

[...]

Nazish Sayed¹, Yingxiang Huang², Khiem Van Nguyen², Zuzana Krejciova-Rajaniemi, Anissa P. Grawe², Tianxiang Gao³, Robert Tibshirani¹, Trevor Hastie¹, Ayelet Alpert⁴, Lu Cui¹, Tatiana Kuznetsova⁵, Yael Rosenberg-Hasson¹, Rita Ostan⁶, Daniela Monti⁷, Benoit Lehallier¹, Shai S. Shen-Orr⁴, Holden T. Maecker¹, Cornelia L. Dekker¹, Tony Wyss-Coray¹, Claudio Franceschi, Vladimir Jojic, Francois Haddad¹, Jose G. Montoya¹, Joseph C. Wu¹, Mark M. Davis¹, David Furman - Show less +22 more•Institutions (7)

Stanford University¹, Buck Institute for Research on Aging², University of North Carolina at Chapel Hill³, Technion – Israel Institute of Technology⁴, Katholieke Universiteit Leuven⁵, University of Bologna⁶, University of Florence⁷

01 Jul 2021

TL;DR: A key role is identified in age-related chronic inflammation of CXCL9 in cardiac aging, adverse cardiac remodeling and poor vascular function and a metric for multimorbidity is derived that can be utilized for the early detection of age- related clinical phenotypes.

...read moreread less

Abstract: While many diseases of aging have been linked to the immunological system, immune metrics capable of identifying the most at-risk individuals are lacking. From the blood immunome of 1,001 individuals aged 8–96 years, we developed a deep-learning method based on patterns of systemic age-related inflammation. The resulting inflammatory clock of aging (iAge) tracked with multimorbidity, immunosenescence, frailty and cardiovascular aging, and is also associated with exceptional longevity in centenarians. The strongest contributor to iAge was the chemokine CXCL9, which was involved in cardiac aging, adverse cardiac remodeling and poor vascular function. Furthermore, aging endothelial cells in human and mice show loss of function, cellular senescence and hallmark phenotypes of arterial stiffness, all of which are reversed by silencing CXCL9. In conclusion, we identify a key role of CXCL9 in age-related chronic inflammation and derive a metric for multimorbidity that can be utilized for the early detection of age-related clinical phenotypes. From the blood immunome of 1,001 individuals aged 8–96 years, the authors used deep learning to develop an inflammatory clock of aging (iAge) that tracks with multimorbidity, immunosenescence, frailty and cardiovascular aging, and is also associated with exceptional longevity in centenarians. The main contributor to iAge is the chemokine CXCL9, which is shown to control endothelial cell senescence and function.

...read moreread less

155 citations

Posted Content•DOI•

Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the US

[...]

Estee Y Cramer¹, Evan L. Ray¹, Velma K. Lopez², Johannes Bracher³ +281 more•Institutions (53)

05 Feb 2021-medRxiv

TL;DR: In this paper, the authors systematically evaluated 23 models that regularly submitted forecasts of reported weekly incident COVID-19 mortality counts in the US at the state and national level at the CDC.

...read moreread less

Abstract: Short-term probabilistic forecasts of the trajectory of the COVID-19 pandemic in the United States have served as a visible and important communication channel between the scientific modeling community and both the general public and decision-makers Forecasting models provide specific, quantitative, and evaluable predictions that inform short-term decisions such as healthcare staffing needs, school closures, and allocation of medical supplies In 2020, the COVID-19 Forecast Hub (https://covid19forecasthuborg/) collected, disseminated, and synthesized hundreds of thousands of specific predictions from more than 50 different academic, industry, and independent research groups This manuscript systematically evaluates 23 models that regularly submitted forecasts of reported weekly incident COVID-19 mortality counts in the US at the state and national level One of these models was a multi-model ensemble that combined all available forecasts each week The performance of individual models showed high variability across time, geospatial units, and forecast horizons Half of the models evaluated showed better accuracy than a naive baseline model In combining the forecasts from all teams, the ensemble showed the best overall probabilistic accuracy of any model Forecast accuracy degraded as models made predictions farther into the future, with probabilistic accuracy at a 20-week horizon more than 5 times worse than when predicting at a 1-week horizon This project underscores the role that collaboration and active coordination between governmental public health agencies, academic modeling teams, and industry partners can play in developing modern modeling capabilities to support local, state, and federal response to outbreaks

...read moreread less

68 citations

Posted Content•DOI•

Can Auxiliary Indicators Improve COVID-19 Forecasting and Hotspot Prediction?

[...]

Daniel J. McDonald¹, Jacob Bien², Alden Green³, Addison J Hu³, Nat DeFries³, Sangwon Hyun², Natalia L. Oliveira³, James Sharpnack⁴, Jingjing Tang³, Robert Tibshirani⁵, Valérie Ventura³, Larry Wasserman³, Ryan J. Tibshirani³ - Show less +9 more•Institutions (5)

University of British Columbia¹, University of Southern California², Carnegie Mellon University³, University of California, Berkeley⁴, Stanford University⁵

25 Jun 2021-medRxiv

TL;DR: In this paper, the utility of these indicators from a forecasting perspective is studied. But the authors focus on five indicators, derived from medical insurance claims data, web search queries, and online survey responses, and ask whether their inclusion in a simple model leads to improved predictive accuracy relative to a similar model excluding it.

...read moreread less

Abstract: Reliable, short-term forecasts of traditional public health reporting streams (such as cases, hospitalizations, and deaths) are a key ingredient in effective public health decision-making during a pandemic. Since April 2020, our research group has worked with data partners to collect, curate, and make publicly available numerous real-time COVID-19 indicators, providing multiple views of pandemic activity. This paper studies the utility of these indicators from a forecasting perspective. We focus on five indicators, derived from medical insurance claims data, web search queries, and online survey responses. For each indicator, we ask whether its inclusion in a simple model leads to improved predictive accuracy relative to a similar model excluding it. We consider both probabilistic forecasting of confirmed COVID-19 case rates and binary prediction of case "hotspots". Since the values of indicators (and case rates) are commonly revised over time, we take special care to ensure that the data provided to a forecaster is the version that would have been available at the time the forecast was made. Our analysis shows that consistent but modest gains in predictive accuracy are obtained by using these indicators, and furthermore, these gains are related to periods in which the auxiliary indicators behave as "leading indicators" of case rates.

...read moreread less

23 citations

Journal Article•DOI•

De novo mutational signature discovery in tumor genomes using SparseSignatures.

[...]

Avantika Lal¹, Keli Liu¹, Robert Tibshirani¹, Arend Sidow¹, Daniele Ramazzotti¹ - Show less +1 more•Institutions (1)

Stanford University¹

28 Jun 2021-PLOS Computational Biology

TL;DR: SparseSignatures as mentioned in this paper uses a user-specified background signature, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets.

...read moreread less

Abstract: Cancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or "mutational signatures". Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates a user-specified background signature, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets. We show that SparseSignatures outperforms current state-of-the-art methods on simulated data using a variety of standard metrics. We then apply SparseSignatures to whole genome sequences of pancreatic and breast tumors, discovering well-differentiated signatures that are linked to known mutagenic mechanisms and are strongly associated with patient clinical features.

...read moreread less

18 citations

Journal Article•DOI•

Penalized regression for left-truncated and right-censored survival data.

[...]

Sarah F. McGough¹, Devin Incerti¹, Svetlana Lyalina¹, Ryan Copping¹, Balasubramanian Narasimhan², Robert Tibshirani² - Show less +2 more•Institutions (2)

Genentech¹, Stanford University²

24 Jul 2021-Statistics in Medicine

TL;DR: A penalized Cox proportional hazards model is applied for left-truncated and right-censored survival data and implications of left truncation adjustment on bias and interpretation are assessed.

...read moreread less

Abstract: High-dimensional data are becoming increasingly common in the medical field as large volumes of patient information are collected and processed by high-throughput screening, electronic health records, and comprehensive genomic testing. Statistical models that attempt to study the effects of many predictors on survival typically implement feature selection or penalized methods to mitigate the undesirable consequences of overfitting. In some cases survival data are also left-truncated which can give rise to an immortal time bias, but penalized survival methods that adjust for left truncation are not commonly implemented. To address these challenges, we apply a penalized Cox proportional hazards model for left-truncated and right-censored survival data and assess implications of left truncation adjustment on bias and interpretation. We use simulation studies and a high-dimensional, real-world clinico-genomic database to highlight the pitfalls of failing to account for left truncation in survival modeling.

...read moreread less

17 citations

Journal Article•DOI•

Increased diversity of gut microbiota during active oral immunotherapy in peanut-allergic adults.

[...]

Ziyuan He¹, VL Gouri Vadali, Rose L. Szabady, Wenming Zhang¹, Jason M. Norman, Bruce L. Roberts, Robert Tibshirani¹, Manisha Desai¹, R. Sharon Chinthrajah¹, Stephen J. Galli¹, Sandra Andorf², Sandra Andorf¹, Sandra Andorf³, Kari C. Nadeau¹ - Show less +10 more•Institutions (3)

Stanford University¹, Cincinnati Children's Hospital Medical Center², University of Cincinnati Academic Health Center³

01 Mar 2021-Allergy

TL;DR: Asthma prolongs intubation in COVID-19 and race is associated with differences in airway inflammation in patients with asthma, and eosinophil responses during COIDs infections and coronavirus vaccination.

...read moreread less

Abstract: R E FE R E N C E S 1. Carli G, Cecchi L, Stebbing J, Parronchi P, Farsi A. Is asthma protective against COVID-19? Allergy. 2021;76:866–936. 2. Mahdavinia M, Foster KJ, Jauregui E, et al. Asthma prolongs intubation in COVID-19. J Allergy Clin Immunol Pract. 2020;8(7):2388-2391. 3. Hooper MW, Nápoles AM, Pérez-Stable EJ. COVID-19 and racial/ethnic disparities. JAMA. 2020;323(24):2466-2467. 4. Noonan AS, Velasco-Mondragon HE, Wagner FA. Improving the health of African Americans in the USA: an overdue opportunity for social justice. Public Health Rev. 2016;37(1):1-20. 5. Nyenhuis SM, Krishnan JA, Berry A, et al. Race is associated with differences in airway inflammation in patients with asthma. J Allergy Clin Immunol. 2017;140(1):257-265.e211. 6. Lindsley AW, Schwartz JT, Rothenberg ME. Eosinophil responses during COVID-19 infections and coronavirus vaccination. J Allergy Clin Immunol. 2020;146(1):1-7.

...read moreread less

15 citations

Posted Content•

Cross-validation: what does it estimate and how well does it do it?

[...]

Stephen Bates, Stephen S. Bates¹, Trevor Hastie, Robert Tibshirani•Institutions (1)

Fisheries and Oceans Canada¹

01 Apr 2021-arXiv: Methodology

TL;DR: This paper showed that the standard confidence intervals for prediction error derived from cross-validation may have coverage far below the desired level because each data point is used for both training and testing, and so the usual estimate of variance is too small.

...read moreread less

Abstract: Cross-validation is a widely-used technique to estimate prediction error, but its behavior is complex and not fully understood Ideally, one would like to think that cross-validation estimates the prediction error for the model at hand, fit to the training data We prove that this is not the case for the linear model fit by ordinary least squares; rather it estimates the average prediction error of models fit on other unseen training sets drawn from the same population We further show that this phenomenon occurs for most popular estimates of prediction error, including data splitting, bootstrapping, and Mallow's Cp Next, the standard confidence intervals for prediction error derived from cross-validation may have coverage far below the desired level Because each data point is used for both training and testing, there are correlations among the measured accuracies for each fold, and so the usual estimate of variance is too small We introduce a nested cross-validation scheme to estimate this variance more accurately, and show empirically that this modification leads to intervals with approximately correct coverage in many examples where traditional cross-validation intervals fail Lastly, our analysis also shows that when producing confidence intervals for prediction accuracy with simple data splitting, one should not re-fit the model on the combined data, since this invalidates the confidence intervals

...read moreread less

13 citations

Journal Article•DOI•

Polygenic risk modeling with latent trait-related genetic components.

[...]

Matthew Aguirre¹, Yosuke Tanigawa¹, Guhan Venkataraman¹, Robert Tibshirani¹, Trevor Hastie¹, Manuel A. Rivas¹ - Show less +2 more•Institutions (1)

Stanford University¹

08 Feb 2021-European Journal of Human Genetics

TL;DR: In this paper, a latent factor model of genetic risk based on components from Decomposition of Genetic Associations (DeGAs) is introduced, which is called the DeGAs polygenic risk score (dPRS).

...read moreread less

Abstract: Polygenic risk models have led to significant advances in understanding complex diseases and their clinical presentation. While polygenic risk scores (PRS) can effectively predict outcomes, they do not generally account for disease subtypes or pathways which underlie within-trait diversity. Here, we introduce a latent factor model of genetic risk based on components from Decomposition of Genetic Associations (DeGAs), which we call the DeGAs polygenic risk score (dPRS). We compute DeGAs using genetic associations for 977 traits and find that dPRS performs comparably to standard PRS while offering greater interpretability. We show how to decompose an individual's genetic risk for a trait across DeGAs components, with examples for body mass index (BMI) and myocardial infarction (heart attack) in 337,151 white British individuals in the UK Biobank, with replication in a further set of 25,486 non-British white individuals. We find that BMI polygenic risk factorizes into components related to fat-free mass, fat mass, and overall health indicators like physical activity. Most individuals with high dPRS for BMI have strong contributions from both a fat-mass component and a fat-free mass component, whereas a few "outlier" individuals have strong contributions from only one of the two components. Overall, our method enables fine-scale interpretation of the drivers of genetic risk for complex traits.

...read moreread less

13 citations

Journal Article•DOI•

Rapid Screening of COVID-19 Directly from Clinical Nasopharyngeal Swabs Using the MasSpec Pen.

[...]

25 Aug 2021-Analytical Chemistry

TL;DR: In this article, the MasSpec Pen technology integrated to electrospray ionization (ESI) was used for direct analysis of clinical swabs and investigate its use for COVID-19 screening.

...read moreread less

Abstract: The outbreak of COVID-19 has created an unprecedent global crisis. While the polymerase chain reaction (PCR) is the gold standard method for detecting active SARS-CoV-2 infection, alternative high-throughput diagnostic tests are of a significant value to meet universal testing demands. Here, we describe a new design of the MasSpec Pen technology integrated to electrospray ionization (ESI) for direct analysis of clinical swabs and investigate its use for COVID-19 screening. The redesigned MasSpec Pen system incorporates a disposable sampling device refined for uniform and efficient analysis of swab tips via liquid extraction directly coupled to an ESI source. Using this system, we analyzed nasopharyngeal swabs from 244 individuals including symptomatic COVID-19 positive, symptomatic negative, and asymptomatic negative individuals, enabling rapid detection of rich lipid profiles. Two statistical classifiers were generated based on the lipid information acquired. Classifier 1 was built to distinguish symptomatic PCR-positive from asymptomatic PCR-negative individuals, yielding a cross-validation accuracy of 83.5%, sensitivity of 76.6%, and specificity of 86.6%, and validation set accuracy of 89.6%, sensitivity of 100%, and specificity of 85.3%. Classifier 2 was built to distinguish symptomatic PCR-positive patients from negative individuals including symptomatic PCR-negative patients with moderate to severe symptoms and asymptomatic individuals, yielding a cross-validation accuracy of 78.4%, specificity of 77.21%, and sensitivity of 81.8%. Collectively, this study suggests that the lipid profiles detected directly from nasopharyngeal swabs using MasSpec Pen-ESI mass spectrometry (MS) allow fast (under a minute) screening of the COVID-19 disease using minimal operating steps and no specialized reagents, thus representing a promising alternative high-throughput method for screening of COVID-19.

...read moreread less

Journal Article•

LassoNet: A Neural Network with Feature Sparsity

[...]

Ismael Lemhadri, Feng Ruan, Louis Abraham, Robert Tibshirani

01 Jan 2021-Journal of Machine Learning Research

TL;DR: LassoNet as discussed by the authors enforces a hierarchy: a feature can participate in a hidden unit only if its linear representative is active, and integrates feature selection with the parameter learning directly, as a result it delivers an entire regularization path of solutions with a range of feature sparsity.

...read moreread less

Abstract: Much work has been done recently to make neural networks more interpretable, and one obvious approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or $\ell_1$-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. Our approach enforces a hierarchy: specifically a feature can participate in a hidden unit only if its linear representative is active. Unlike other approaches to feature selection for neural nets, our method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. On systematic experiments, LassoNet significantly outperforms state-of-the-art methods for feature selection and regression. The LassoNet method uses projected proximal gradient descent, and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network.

...read moreread less

Posted Content•DOI•

Significant Sparse Polygenic Risk Scores across 428 traits in UK Biobank

[...]

Yosuke Tanigawa¹, Yosuke Tanigawa², Junyang Qian², Guhan Venkataraman², Johanne Marie Justesen², Ruilin Li², Robert Tibshirani², Trevor Hastie², Manuel A. Rivas² - Show less +5 more•Institutions (2)

Massachusetts Institute of Technology¹, Stanford University²

06 Sep 2021-medRxiv

TL;DR: In this article, a systematic assessment of polygenic risk score (PRS) prediction across more than 1,600 traits using genetic and phenotype data in the UK Biobank is presented.

...read moreread less

Abstract: We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,600 traits using genetic and phenotype data in the UK Biobank. We report 428 sparse PRS models with significant (p < 2.5 x 10-5) incremental predictive performance when compared against the covariate-only model that considers age, sex, and the genotype principal components. We report a significant correlation between the number of genetic variants selected in the sparse PRS model and the incremental predictive performance in quantitative traits (Spearmans {rho} = 0.54, p = 1.4 x 10-15), but not in binary traits ({rho} = 0.059, p = 0.35). The sparse PRS model trained on European individuals showed limited transferability when evaluated on individuals from non-European individuals in the UK Biobank. We provide the PRS model weights on the Global Biobank Engine (https://biobankengine.stanford.edu/prs).

...read moreread less

Journal Article•DOI•

The stanford prostate cancer calculator: Development and external validation of online nomograms incorporating PIRADS scores to predict clinically significant prostate cancer.

[...]

Nancy Wang¹, Steve R. Zhou¹, Leo C Chen¹, Robert Tibshirani¹, Richard E. Fan¹, Pejman Ghanouni¹, Alan Thong¹, Katherine J. To'o¹, Kamyar Ghabili², Jeffrey W. Nix³, Jennifer Gordetsky³, Preston C. Sprenkle², Soroush Rais-Bahrami³, Geoffrey A. Sonn¹ - Show less +10 more•Institutions (3)

Stanford University¹, Yale University², University of Alabama at Birmingham³

08 Jul 2021-Urologic Oncology-seminars and Original Investigations

TL;DR: The Stanford Prostate Cancer Calculator (SPCC) as mentioned in this paper combines pre-biopsy mpMRI with clinical data to more accurately predict the probability of CSC in men of all biopsy backgrounds.

...read moreread less

Abstract: Background While multiparametric MRI (mpMRI) has high sensitivity for detection of clinically significant prostate cancer (CSC), false positives and negatives remain common. Calculators that combine mpMRI with clinical variables can improve cancer risk assessment, while providing more accurate predictions for individual patients. We sought to create and externally validate nomograms incorporating Prostate Imaging Reporting and Data System (PIRADS) scores and clinical data to predict the presence of CSC in men of all biopsy backgrounds. Methods Data from 2125 men undergoing mpMRI and MR fusion biopsy from 2014 to 2018 at Stanford, Yale, and UAB were prospectively collected. Clinical data included age, race, PSA, biopsy status, PIRADS scores, and prostate volume. A nomogram predicting detection of CSC on targeted or systematic biopsy was created. Results Biopsy history, Prostate Specific Antigen (PSA) density, PIRADS score of 4 or 5, Caucasian race, and age were significant independent predictors. Our nomogram—the Stanford Prostate Cancer Calculator (SPCC)—combined these factors in a logistic regression to provide stronger predictive accuracy than PSA density or PIRADS alone. Validation of the SPCC using data from Yale and UAB yielded robust AUC values. Conclusions The SPCC combines pre-biopsy mpMRI with clinical data to more accurately predict the probability of CSC in men of all biopsy backgrounds. The SPCC demonstrates strong external generalizability with successful validation in two separate institutions. The calculator is available as a free web-based tool that can direct real-time clinical decision-making.

...read moreread less

Journal Article•DOI•

Fast Numerical Optimization for Genome Sequencing Data in Population Biobanks.

[...]

Ruilin Li¹, Christopher C. Chang, Yosuke Tanigawa¹, Balasubramanian Narasimhan¹, Trevor Hastie¹, Robert Tibshirani¹, Manuel A. Rivas¹ - Show less +3 more•Institutions (1)

Stanford University¹

19 Jun 2021-Bioinformatics

TL;DR: In this article, Ravi et al. developed two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals.

...read moreread less

Abstract: Motivation Large-scale and high-dimensional genome sequencing data poses computational challenges. General purpose optimization tools are usually not optimal in terms of computational and memory performance for genetic data. Results We develop two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals. These genetic variants are encoded by the values in the set {0, 1, 2, NA}. We take advantage of this fact and use two bits to represent each entry in a genetic matrix, which reduces memory requirement by a factor of 32 compared to a double precision floating point representation. Using this representation, we implemented an iteratively reweighted least square algorithm to solve Lasso regressions on genetic matrices, which we name snpnet-2.0. When the dataset contains many rare variants, the predictors can be encoded in a sparse matrix. We utilize the sparsity in the predictor matrix to further reduce memory requirement and computational speed. Our sparse genetic matrix implementation uses both the compact 2-bit representation and a simplified version of compressed sparse block format so that matrix-vector multiplications can be effectively parallelized on multiple CPU cores. To demonstrate the effectiveness of this representation, we implement an accelerated proximal gradient method to solve group Lasso on these sparse genetic matrices. This solver is named sparse-snpnet, and will also be included as part of snpnet R package. Our implementation is able to solve Lasso and group Lasso, linear, logistic and Cox regression problems on sparse genetic matrices that contain 1,000,000 variants and almost 100,000 individuals within 10 minutes and using less than 32GB of memory. Availability https://github.com/rivas-lab/snpnet/tree/compact.

...read moreread less

Proceedings Article•

LassoNet: Neural Networks with Feature Sparsity

[...]

Ismael Lemhadri¹, Feng Ruan², Robert Tibshirani¹•Institutions (2)

Stanford University¹, University of California, Berkeley²

18 Mar 2021

Posted Content•DOI•

Fast Numerical Optimization for Genome Sequencing Data in Population Biobanks

[...]

Ruilin Li¹, Christopher C. Chang, Yosuke Tanigawa¹, Balasubramanian Narasimhan¹, Trevor Hastie¹, Robert Tibshirani¹, Manuel A. Rivas¹ - Show less +3 more•Institutions (1)

Stanford University¹

16 Feb 2021-bioRxiv

TL;DR: In this paper, the authors developed two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals.

...read moreread less

Abstract: We develop two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals. These genetic variants are encoded by the values in the set {0, 1, 2, NA}. We take advantage of this fact and use two bits to represent each entry in a genetic matrix, which reduces memory requirement by a factor of 32 compared to a double precision floating point representation. Using this representation, we implemented an iteratively reweighted least square algorithm to solve Lasso regressions on genetic matrices, which we name snpnet-2.0. When the dataset contains many rare variants, the predictors can be encoded in a sparse matrix. We utilize the sparsity in the predictor matrix to further reduce memory requirement and computational speed. Our sparse genetic matrix implementation uses both the compact 2-bit representation and a simplified version of compressed sparse block format so that matrix-vector multiplications can be effectively parallelized on multiple CPU cores. To demonstrate the effectiveness of this representation, we implement an accelerated proximal gradient method to solve group Lasso on these sparse genetic matrices. This solver is named sparse-snpnet, and will also be included as part of snpnet R package. Our implementation is able to solve group Lasso problems on sparse genetic matrices with more than 1,000,000 columns and almost 100,000 rows within 10 minutes and using less than 32GB of memory.

...read moreread less

Journal Article•DOI•

Principal component-guided sparse regression

[...]

Jingyi K. Tay¹, Jerome H. Friedman¹, Robert Tibshirani¹•Institutions (1)

Stanford University¹

16 Apr 2021-Canadian Journal of Statistics-revue Canadienne De Statistique

Posted Content•DOI•

Using aggregate patient data at the bedside via an on-demand consultation service

[...]

Alison Callahan¹, Saurabh Gombar¹, Cahan Em², Cahan Em¹, Kenneth Jung¹, Ethan Steinberg¹, Keith E. Morse¹, Robert Tibshirani, Trevor Hastie, Robert A. Harrington¹, Nigam H. Shah¹ - Show less +7 more•Institutions (2)

Stanford University¹, New York University²

21 Jun 2021-medRxiv

TL;DR: The design and implementation of an on-demand consultation service to derive evidence from patient data to answer clinician questions and support their bedside decision making and the tools and methods developed are made publicly available to facilitate the broad adoption of such services by health systems and academic medical centers.

...read moreread less

Abstract: Using evidence derived from previously collected medical records to guide patient care has been a long standing vision of clinicians and informaticians, and one with the potential to transform medical practice. As a result of advances in technical infrastructure, statistical analysis methods, and the availability of patient data at scale, an implementation of this vision is now possible. Motivated by these advances, and the information needs of clinicians in our academic medical center, we offered an on-demand consultation service to derive evidence from patient data to answer clinician questions and support their bedside decision making. We describe the design and implementation of the service as well as a summary of our experience in responding to the first 100 requests. Consultation results informed individual patient care, resulted in changes to institutional practices, and motivated further clinical research. We make the tools and methods developed to implement the service publicly available to facilitate the broad adoption of such services by health systems and academic medical centers.

...read moreread less

Journal Article•DOI•

MassExplorer: a computational tool for analyzing desorption electrospray ionization mass spectrometry data

[...]

Vishnu Shankar¹, Robert Tibshirani¹, Richard N. Zare¹•Institutions (1)

Stanford University¹

25 Oct 2021-Bioinformatics

TL;DR: MassExplorer as mentioned in this paper is a tool to pre-process DESI-MSI data, visualize raw data, build predictive models using the statistical lasso approach to select for a sparse set of significant molecular changes, and interpret selected metabolites.

...read moreread less

Abstract: Summary In the last few years, desorption electrospray ionization mass spectrometry imaging (DESI-MSI) has been increasingly used for simultaneous detection of thousands of metabolites and lipids from human tissues and biofluids. To successfully find the most significant differences between two sets of DESI-MSI data (e.g., healthy vs disease) requires the application of accurate computational and statistical methods that can pre-process the data under various normalization settings and help identify these changes among thousands of detected metabolites. Here, we report MassExplorer, a novel computational tool, to help pre-process DESI-MSI data, visualize raw data, build predictive models using the statistical lasso approach to select for a sparse set of significant molecular changes, and interpret selected metabolites. This tool, which is available for both online and offline use, is flexible for both chemists and biologists and statisticians as it helps in visualizing structure of DESI-MSI data and in analyzing the statistically significant metabolites that are differentially expressed across both sample types. Based on the modules in MassExplorer, we expect it to be immediately useful for various biological and chemical applications in mass spectrometry. Availability and implementation MassExplorer is available as an online R-Shiny application or Mac OS X compatible standalone application. The application, sample performance, source code and corresponding guide can be found at: https://zarelab.com/research/massexplorer-a-tool-to-help-guide-analysis-of-mass-spectrometry-samples/. Supplementary informationMATION Supplementary data are available at Bioinformatics online.

...read moreread less

Journal Article•DOI•

Author Correction: Genetics of 35 blood and urine biomarkers in the UK Biobank.

[...]

Nasa Sinnott-Armstrong¹, Nasa Sinnott-Armstrong², Nasa Sinnott-Armstrong³, Yosuke Tanigawa³, David Amar³, David Amar⁴, Nina Mars², Christian Benner², Matthew Aguirre³, Guhan Venkataraman³, Michael Wainberg³, Hanna Ollila⁵, Hanna Ollila², Hanna Ollila³, Tuomo Kiiskinen⁶, Tuomo Kiiskinen², Aki S. Havulinna², Aki S. Havulinna⁶, James P. Pirruccello⁷, James P. Pirruccello⁵, Junyang Qian³, Anna Shcherbina⁴, Anna Shcherbina², FinnGen⁴, Fatima Rodriguez⁴, Themistocles L. Assimes¹, Themistocles L. Assimes⁴, Vineeta Agarwala⁴, Robert Tibshirani³, Trevor Hastie³, Samuli Ripatti², Samuli Ripatti⁷, Jonathan K. Pritchard³, Mark J. Daly⁷, Mark J. Daly², Mark J. Daly⁵, Manuel A. Rivas³ - Show less +33 more•Institutions (7)

VA Palo Alto Healthcare System¹, University of Helsinki², Stanford University³, Cardiovascular Institute of the South⁴, Harvard University⁵, National Institute for Health and Welfare⁶, Broad Institute⁷

04 Oct 2021-Nature Genetics

Posted Content•DOI•

DCIS genomic signatures define biology and correlate with clinical outcome: a Human Tumor Atlas Network (HTAN) analysis of TBCRC 038 and RAHBT cohorts

[...]

Siri H Strand¹, Siri H Strand², Belén Rivero-Gutiérrez¹, Kathleen E. Houlahan¹, Jose A. Seoane¹, Lorraine M. King³, Tyler Risom¹, Lunden A Simpson³, Sujay Vennam¹, Aziz Khan¹, Luis Cisneros⁴, Timothy Hardman³, Bryan Harmon⁵, Fergus J. Couch, Kristalyn K. Gallagher⁶, Mark R. Kilgore⁷, Shi Wei⁸, Angela DeMichele⁹, Tari King¹⁰, Tari King¹¹, Priscilla F. McAuliffe¹², Julie R. Nangia¹³, Joanna Lee¹⁴, Jennifer F. Tseng¹⁵, Anna Maria Storniolo¹⁶, Alastair M. Thompson¹³, Gaorav P. Gupta⁶, Robyn Burns, Deborah J. Veis¹⁷, Katherine DeSchryver¹⁷, Chunfang Zhu¹, Magdalena Matusiak¹, Jason K. Wang¹, Shirley Zhu¹, Jen Tappenden¹⁷, Daisy Yi Ding¹, Dadong Zhang³, Jingqin Luo¹⁷, Shu Jiang¹⁷, Sushama Varma¹, Lauren Anderson³, Cody Straub³, Sucheta Srivastava¹, Christina Curtis¹, Robert Tibshirani¹, Robert M. Angelo¹, Allison Hall³, Kouros Owzar³, Kornelia Polyak¹¹, Carlo C. Maley⁴, Jeffrey R. Marks³, Graham A. Colditz¹⁷, E. Shelley Hwang³, Robert B. West¹ - Show less +50 more•Institutions (17)

Stanford University¹, Aarhus University Hospital², Duke University³, Arizona State University⁴, Montefiore Medical Center⁵, University of North Carolina at Chapel Hill⁶, University of Washington⁷, University of Alabama at Birmingham⁸, University of Pennsylvania⁹, Brigham and Women's Hospital¹⁰, Harvard University¹¹, University of Pittsburgh¹², Baylor College of Medicine¹³, University of Texas MD Anderson Cancer Center¹⁴, University of Chicago¹⁵, Indiana University¹⁶, Washington University in St. Louis¹⁷

24 Jul 2021-bioRxiv

TL;DR: In this article, a multiscale, integrated profiling of Ductal Carcinoma in situ (DCIS) with clinical outcomes was performed by analyzing 677 DCIS samples from 481 patients with 7.1 years median followup from the Translational Breast Cancer Research Consortium (TBCRC) 038 study and the Resource of Archival Breast Tissue (RAHBT) cohorts.

...read moreread less

Abstract: SUMMARY Ductal carcinoma in situ (DCIS) is the most common precursor of invasive breast cancer (IBC), with variable propensity for progression. We have performed the first multiscale, integrated profiling of DCIS with clinical outcomes by analyzing 677 DCIS samples from 481 patients with 7.1 years median follow-up from the Translational Breast Cancer Research Consortium (TBCRC) 038 study and the Resource of Archival Breast Tissue (RAHBT) cohorts. We made observations on DNA, RNA, and protein expression, and generated a de novo clustering scheme for DCIS that represents a fundamental transcriptomic organization at this early stage of breast neoplasia. Distinct stromal expression patterns and immune cell compositions were identified. We found RNA expression patterns that correlate with later events. Our multiscale approach employed in situ methods to generate a spatially resolved atlas of breast precancers, where complementary modalities can be directly compared and correlated with conventional pathology findings, disease states, and clinical outcome. HIGHLIGHTS New transcriptomic classification solution reveals 3 major subgroups in DCIS. Four stroma-specific signatures identified. utcome analysis identifies pathways involved in DCIS progression. CNAs characterize high risk of distant relapse IBC subtypes observed in DCIS.

...read moreread less

Posted Content•DOI•

An Open Repository of Real-Time COVID-19 Indicators

[...]

Alex Reinhart¹, Logan C. Brooks¹, Maria Jahja¹, Aaron Rumack¹, Jingjing Tang¹, Wael Al Saeed¹, Taylor Arnold², Amartya Basu¹, Jacob Bien³, Ángel Alexander Cabrera¹, Andrew Chin¹, Eu Jing Chua¹, Brian Clark¹, Nat DeFries¹, Jodi Forlizzi¹, Samuel Gratzl¹, Alden Green¹, George Haff¹, Robin Han¹, Addison J Hu¹, Sangwon Hyun³, Ananya Joshi¹, Jimi Kim⁴, Andrew Kuznetsov¹, Wichada La Motte-Kerr¹, Yeon Jin Lee¹, Kenneth K. Lee⁵, Zachary C. Lipton¹, Michael Xieyang Liu¹, Lester Mackey⁶, Kathryn Mazaitis¹, Daniel J. McDonald⁷, Balasubramanian Narasimhan⁸, Natalia L. Oliveira¹, Pratik Patil¹, Adam Perer¹, Collin A Politsch¹, Samyak Rajanala⁸, Dawn Rucker¹, Nigam H. Shah⁸, Vishnu Shankar⁸, James Sharpnack⁵, Dmitry Shemetov¹, Noah Simon⁹, Vishakha Srivastava¹, Shuyi Tan⁷, Robert Tibshirani⁸, Elena Tuzhilina⁸, Ana Karina Van Nortwick¹, Valérie Ventura¹, Larry Wasserman¹, Jeremy C. Weiss¹, Kristin Williams¹, Roni Rosenfeld¹, Ryan J. Tibshirani¹ - Show less +51 more•Institutions (9)

Carnegie Mellon University¹, University of Richmond², University of Southern California³, University of Texas at Dallas⁴, University of California, Davis⁵, Microsoft⁶, University of British Columbia⁷, Stanford University⁸, University of Washington⁹

16 Jul 2021-medRxiv

TL;DR: The COVIDcast API as mentioned in this paper provides open access to both traditional public health surveillance signals (cases, deaths, and hospitalizations) and many auxiliary indicators of COVID-19 activity, such as signals extracted from de-identified medical claims data, massive online surveys, cell phone mobility data, and internet search trends.

...read moreread less

Abstract: The COVID-19 pandemic presented enormous data challenges in the United States. Policy makers, epidemiological modelers, and health researchers all require up-to-date data on the pandemic and relevant public behavior, ideally at fine spatial and temporal resolution. The COVIDcast API is our attempt to fill this need: operational since April 2020, it provides open access to both traditional public health surveillance signals (cases, deaths, and hospitalizations) and many auxiliary indicators of COVID-19 activity, such as signals extracted from de-identified medical claims data, massive online surveys, cell phone mobility data, and internet search trends. These are available at a fine geographic resolution (mostly at the county level) and are updated daily. The COVIDcast API also tracks all revisions to historical data, allowing modelers to account for the frequent revisions and backfill that are common for many public health data sources. All of the data is available in a common format through the API and accompanying R and Python software packages. This paper describes the data sources and signals, and provides examples demonstrating that the auxiliary signals in the COVIDcast API present information relevant to tracking COVID activity, augmenting traditional public health reporting and empowering research and decision-making.

...read moreread less

Journal Article•DOI•

Survival Analysis on Rare Events Using Group-Regularized Multi-Response Cox Regression.

[...]

Ruilin Li¹, Yosuke Tanigawa¹, Johanne Marie Justesen¹, Jonathan Taylor¹, Trevor Hastie¹, Robert Tibshirani¹, Manuel A. Rivas¹ - Show less +3 more•Institutions (1)

Stanford University¹

09 Feb 2021-Bioinformatics

TL;DR: In this paper, a sparse-group regularized Cox regression method was proposed to improve the prediction performance of large-scale and high-dimensional survival data with few observed events, which is applicable when there is one or more other survival responses that 1 has a large number of observed events; 2 share a common set of associated predictors with the rare event response.

...read moreread less

Abstract: MOTIVATION The prediction performance of Cox proportional hazard model suffers when there are only few uncensored events in the training data RESULTS We propose a Sparse-Group regularized Cox regression method to improve the prediction performance of large-scale and high-dimensional survival data with few observed events Our approach is applicable when there is one or more other survival responses that 1 has a large number of observed events; 2 share a common set of associated predictors with the rare event response This scenario is common in the UK Biobank (Sudlow et al, 2015) dataset where records for a large number of common and less prevalent diseases of the same set of individuals are available By analyzing these responses together, we hope to achieve higher prediction performance than when they are analyzed individually To make this approach practical for large-scale data, we developed an accelerated proximal gradient optimization algorithm as well as a screening procedure inspired by Qian et al (2020) AVAILABILITY https://githubcom/rivas-lab/multisnpnet-Cox SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online

...read moreread less

Journal Article•DOI•

Assessment of heterogeneous treatment effect estimation accuracy via matching.

[...]

Zijun Gao¹, Trevor Hastie¹, Robert Tibshirani¹•Institutions (1)

Stanford University¹

29 Apr 2021-Statistics in Medicine

TL;DR: In this article, a matching distance derived from proximity scores in random forests is proposed to assess the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard prediction errors is not applicable.

...read moreread less

Abstract: We study the assessment of the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard computation of prediction errors is not applicable. To tackle the difficulty, we propose an assessment approach by constructing pseudo-observations of the HTE based on matching. Our contributions are three-fold: first, we introduce a novel matching distance derived from proximity scores in random forests; second, we formulate the matching problem as an average minimum-cost flow problem and provide an efficient algorithm; third, we propose a match-then-split principle for the assessment with cross-validation. We demonstrate the efficacy of the assessment approach using simulations and a real dataset.

...read moreread less

Posted Content•DOI•

Rapid Screening of COVID-19 Disease Directly from Clinical Nasopharyngeal Swabs using the MasSpec Pen Technology

[...]

19 May 2021-medRxiv

TL;DR: In this article, the MasSpec Pen technology integrated to electrospray ionization (ESI) was used for direct analysis of clinical swabs and investigate its use for COVID-19 screening.

...read moreread less

Abstract: The outbreak of COVID-19 has created an unprecedent global crisis. While PCR is the gold standard method for detecting active SARS-CoV-2 infection, alternative high-throughput diagnostic tests are of significant value to meet universal testing demands. Here, we describe a new design of the MasSpec Pen technology integrated to electrospray ionization (ESI) for direct analysis of clinical swabs and investigate its use for COVID-19 screening. The redesigned MasSpec Pen system incorporates a disposable sampling device refined for uniform and efficient analysis of swab tips via liquid extraction directly coupled to a ESI source. Using this system, we analyzed nasopharyngeal swabs from 244 individuals including symptomatic COVID-19 positive, symptomatic negative, and asymptomatic negative individuals, enabling rapid detection of rich lipid profiles. Two statistical classifiers were generated based on the lipid information aquired. Classifier 1 was built to distinguish symptomatic PCR-positive from asymptomatic PCR-negative individuals, yielding cross-validation accuracy of 83.5%, sensitivity of 76.6%, and specificity of 86.6%, and validation set accuracy of 89.6%, sensitivity of 100%, and specificity of 85.3%. Classifier 2 was built to distinguish symptomatic PCR-positive patients from negative individuals including symptomatic PCR-negative patients with moderate to severe symptoms and asymptomatic individuals, yielding a cross-validation accuracy of 78.4% accuracy, specificity of 77.21%, and sensitivity of 81.8%. Collectively, this study suggests that the lipid profiles detected directly from nasopharyngeal swabs using MasSpec Pen-ESI MS allows fast (under a minute) screening of COVID-19 disease using minimal operating steps and no specialized reagents, thus representing a promising alternative high-throughput method for screening of COVID-19.

...read moreread less

Journal Article•DOI•

Correction to: The Bootstrap Method for Assessing Statistical Accuracy

[...]

Bradley Efron¹, Robert Tibshirani¹•Institutions (1)

Stanford University¹

01 Jan 2021-Behaviormetrika

TL;DR: Efron and Tibshirani as discussed by the authors used the Bootstrap Method for Assessing Statistical Accuracy (BMAR) to assess statistical accuracy. But their work was limited to a single dataset.

...read moreread less

Abstract: The article The Bootstrap Method for Assessing Statistical Accuracy, written by Bradley Efron and Robert Tibshirani, was originally published Online First without Open Access.

...read moreread less

Journal Article•DOI•

Author Correction: An inflammatory aging clock (iAge) based on deep learning tracks multimorbidity, immunosenescence, frailty and cardiovascular aging

[...]

01 Aug 2021

Posted Content•DOI•

Penalized regression for left-truncated and right-censored survival data

[...]

Sarah F. McGough¹, Devin Incerti¹, Svetlana Lyalina¹, Ryan Copping¹, Balasubramanian Narasimhan², Robert Tibshirani² - Show less +2 more•Institutions (2)

Genentech¹, Stanford University²

12 Feb 2021-medRxiv

TL;DR: In this paper, a penalized Cox proportional hazards model for left-truncated and right-censored survival data was applied to assess the implications of left truncation adjustment on bias and interpretation.

...read moreread less

Abstract: High-dimensional data are becoming increasingly common in the medical field as large volumes of patient information are collected and processed by high-throughput screening, electronic health records (EHRs), and comprehensive genomic testing. Statistical models that attempt to study the effects of many predictors on survival typically implement feature selection or penalized methods to mitigate the undesirable consequences of overfitting. In some cases survival data is also left-truncated which can give rise to an immortal time bias, but penalized survival methods that adjust for left truncation are not commonly implemented. To address these challenges, we apply a penalized Cox proportional hazards model for left-truncated and right-censored survival data and assess implications of left truncation adjustment on bias and interpretation. We use simulation studies and a high-dimensional, real-world clinico-genomic database (CGDB) to highlight the pitfalls of failing to account for left truncation in survival modeling.

...read moreread less

Posted Content•

Survival stacking: casting survival analysis as a classification problem

[...]

Erin Craig, Chenyang Zhong, Robert Tibshirani

28 Jul 2021-arXiv: Methodology

TL;DR: In this article, a method for casting survival analysis problems as classification problems is presented, thereby allowing the use of general classification methods and software in a survival setting. But this method is not suitable for right-censored data.

...read moreread less

Abstract: While there are many well-developed data science methods for classification and regression, there are relatively few methods for working with right-censored data. Here, we present "survival stacking": a method for casting survival analysis problems as classification problems, thereby allowing the use of general classification methods and software in a survival setting. Inspired by the Cox partial likelihood, survival stacking collects features and outcomes of survival data in a large data frame with a binary outcome. We show that survival stacking with logistic regression is approximately equivalent to the Cox proportional hazards model. We further recommend methods for evaluating model performance in the survival stacked setting, and we illustrate survival stacking on real and simulated data. By reframing survival problems as classification problems, we make it possible for data scientists to use well-known learning algorithms (including random forests, gradient boosting machines and neural networks) in a survival setting, and lower the barrier for flexible survival modeling.

...read moreread less