scispace - formally typeset
Search or ask a question

Showing papers by "Robert Tibshirani published in 2020"


25 Apr 2020
TL;DR: In this paper, the authors describe the important ideas in these areas in a common conceptual framework, and the emphasis is on concepts rather than mathematics, with a liberal use of color graphics.
Abstract: During the past decade there has been an explosion in computation and information technology. With it have come vast amounts of data in a variety of fields such as medicine, biology, finance, and marketing. The challenge of understanding these data has led to the development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics. Many of these tools have common underpinnings but are often expressed with different terminology. This book describes the important ideas in these areas in a common conceptual framework. While the approach is statistical, the emphasis is on concepts rather than mathematics. Many examples are given, with a liberal use of color graphics. It is a valuable resource for statisticians and anyone interested in data mining in science or industry. The book's coverage is broad, from supervised learning (prediction) to unsupervised learning. The many topics include neural networks, support vector machines, classification trees and boosting---the first comprehensive treatment of this topic in any book. This major new edition features many topics not covered in the original, including graphical models, random forests, ensemble methods, least angle regression and path algorithms for the lasso, non-negative matrix factorization, and spectral clustering. There is also a chapter on methods for ``wide'' data (p bigger than n), including multiple testing and false discovery rates. Trevor Hastie, Robert Tibshirani, and Jerome Friedman are professors of statistics at Stanford University. They are prominent researchers in this area: Hastie and Tibshirani developed generalized additive models and wrote a popular book of that title. Hastie co-developed much of the statistical modeling software and environment in R/S-PLUS and invented principal curves and surfaces. Tibshirani proposed the lasso and is co-author of the very successful An Introduction to the Bootstrap. Friedman is the co-inventor of many data-mining tools including CART, MARS, projection pursuit and gradient boosting.

730 citations


Journal ArticleDOI
TL;DR: Outpatient and asymptomatic individuals’ SARS-CoV-2 antibodies, including IgG, progressively decreased during observation up to five months post-infection, but antibody responses in acute illness were insufficient to predict inpatient outcomes.
Abstract: SARS-CoV-2-specific antibodies, particularly those preventing viral spike receptor binding domain (RBD) interaction with host angiotensin-converting enzyme 2 (ACE2) receptor, can neutralize the virus. It is, however, unknown which features of the serological response may affect clinical outcomes of COVID-19 patients. We analyzed 983 longitudinal plasma samples from 79 hospitalized COVID-19 patients and 175 SARS-CoV-2-infected outpatients and asymptomatic individuals. Within this cohort, 25 patients died of their illness. Higher ratios of IgG antibodies targeting S1 or RBD domains of spike compared to nucleocapsid antigen were seen in outpatients who had mild illness versus severely ill patients. Plasma antibody increases correlated with decreases in viral RNAemia, but antibody responses in acute illness were insufficient to predict inpatient outcomes. Pseudovirus neutralization assays and a scalable ELISA measuring antibodies blocking RBD-ACE2 interaction were well correlated with patient IgG titers to RBD. Outpatient and asymptomatic individuals' SARS-CoV-2 antibodies, including IgG, progressively decreased during observation up to five months post-infection.

369 citations


Journal ArticleDOI
09 Apr 2020-Nature
TL;DR: It is shown that, although levels are very low in early-stage lung cancers, ctDNA is present prior to treatment in most patients and its presence is strongly prognostic, and a machine-learning method termed ‘lung cancer likelihood in plasma’ (Lung-CLiP) is developed, which can robustly discriminate early-Stage lung cancer patients from risk-matched controls.
Abstract: Radiologic screening of high-risk adults reduces lung-cancer-related mortality1,2; however, a small minority of eligible individuals undergo such screening in the United States3,4. The availability of blood-based tests could increase screening uptake. Here we introduce improvements to cancer personalized profiling by deep sequencing (CAPP-Seq)5, a method for the analysis of circulating tumour DNA (ctDNA), to better facilitate screening applications. We show that, although levels are very low in early-stage lung cancers, ctDNA is present prior to treatment in most patients and its presence is strongly prognostic. We also find that the majority of somatic mutations in the cell-free DNA (cfDNA) of patients with lung cancer and of risk-matched controls reflect clonal haematopoiesis and are non-recurrent. Compared with tumour-derived mutations, clonal haematopoiesis mutations occur on longer cfDNA fragments and lack mutational signatures that are associated with tobacco smoking. Integrating these findings with other molecular features, we develop and prospectively validate a machine-learning method termed 'lung cancer likelihood in plasma' (Lung-CLiP), which can robustly discriminate early-stage lung cancer patients from risk-matched controls. This approach achieves performance similar to that of tumour-informed ctDNA detection and enables tuning of assay specificity in order to facilitate distinct clinical applications. Our findings establish the potential of cfDNA for lung cancer screening and highlight the importance of risk-matching cases and controls in cfDNA-based screening studies.

320 citations


Journal ArticleDOI
14 Oct 2020-Nature
TL;DR: TheMAQC Society Board of Directors*, Levi Waldron, Bo Wang, Chris McIntosh, Anna Goldenberg, Anshul Kundaje, Casey S. Greene, Tamara Broderick, Michael M. Hoffman, Jeffrey T. Leek, Keegan Korthauer, Wolfgang Huber, Joelle Pineau, Robert Tibshirani, Trevor Hastie, John P. Ioannidis, John Quackenbush & Hugo J. W. Aerts
Abstract: Benjamin Haibe-Kains1,2,3,4,5 ✉, George Alexandru Adam, Ahmed Hosny, Farnoosh Khodakarami, Massive Analysis Quality Control (MAQC) Society Board of Directors*, Levi Waldron, Bo Wang, Chris McIntosh, Anna Goldenberg, Anshul Kundaje, Casey S. Greene, Tamara Broderick, Michael M. Hoffman, Jeffrey T. Leek, Keegan Korthauer, Wolfgang Huber, Alvis Brazma, Joelle Pineau, Robert Tibshirani, Trevor Hastie, John P. A. Ioannidis, John Quackenbush & Hugo J. W. L. Aerts

179 citations


Journal ArticleDOI
TL;DR: In this article, the authors identify obstacles hindering transparent and reproducible AI research as faced by McKinney et al. and provide solutions with implications for the broader field, including the broader cancer screening.
Abstract: In their study, McKinney et al. showed the high potential of artificial intelligence for breast cancer screening. However, the lack of detailed methods and computer code undermines its scientific value. We identify obstacles hindering transparent and reproducible AI research as faced by McKinney et al and provide solutions with implications for the broader field.

166 citations


Journal ArticleDOI
25 Jun 2020-Cell
TL;DR: The Molecular Transducers of Physical Activity Consortium (MoTrPAC) will provide a public database that is expected to enhance the understanding of the health benefits of exercise and to provide insight into how physical activity mitigates disease.

163 citations


Journal ArticleDOI
25 Jun 2020-Cell
TL;DR: This study represents a weekly characterization of the human pregnancy metabolome, providing a high-resolution landscape for understanding pregnancy with potential clinical utilities.

113 citations


Journal ArticleDOI
TL;DR: A computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size is proposed.
Abstract: The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports l1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with l1/l2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.

81 citations


Journal ArticleDOI
TL;DR: The gastrointestinal tract is defined as a reservoir of IgE+ B lineage cells in food allergy, as high-throughput DNA analysis indicated tissues were the sites of local isotype switching, and similar antibody sequences for the Ara h 2 peanut allergen were shared between patients.
Abstract: B cells in human food allergy have been studied predominantly in the blood. Little is known about IgE+ B cells or plasma cells in tissues exposed to dietary antigens. We characterized IgE+ clones in blood, stomach, duodenum, and esophagus of 19 peanut-allergic patients, using high-throughput DNA sequencing. IgE+ cells in allergic patients are enriched in stomach and duodenum, and have a plasma cell phenotype. Clonally related IgE+ and non-IgE-expressing cell frequencies in tissues suggest local isotype switching, including transitions between IgA and IgE isotypes. Highly similar antibody sequences specific for peanut allergen Ara h 2 are shared between patients, indicating that common immunoglobulin genetic rearrangements may contribute to pathogenesis. These data define the gastrointestinal tract as a reservoir of IgE+ B lineage cells in food allergy.

80 citations


Journal ArticleDOI
TL;DR: This work introduces a generalizable machine learning platform, the immunological Elastic-Net (iEN), which incorporates immunological knowledge directly into the predictive models, allowing for the inclusion of immune features with strong predictive capabilities even if not consistent with prior knowledge.
Abstract: The dense network of interconnected cellular signalling responses that are quantifiable in peripheral immune cells provides a wealth of actionable immunological insights. Although high-throughput single-cell profiling techniques, including polychromatic flow and mass cytometry, have matured to a point that enables detailed immune profiling of patients in numerous clinical settings, the limited cohort size and high dimensionality of data increase the possibility of false-positive discoveries and model overfitting. We introduce a generalizable machine learning platform, the immunological Elastic-Net (iEN), which incorporates immunological knowledge directly into the predictive models. Importantly, the algorithm maintains the exploratory nature of the high-dimensional dataset, allowing for the inclusion of immune features with strong predictive capabilities even if not consistent with prior knowledge. In three independent studies our method demonstrates improved predictions for clinically relevant outcomes from mass cytometry data generated from whole blood, as well as a large simulated dataset. The iEN is available under an open-source licence.

40 citations


Journal ArticleDOI
TL;DR: In this paper, Bertsimas, King and Mazumder showed that the classical best subset selection problem in regression modeling can be formulated as a mixed integer optimization (MIO) problem, which can now be solved at much larger problem sizes than what was thought possible in the statistics community.
Abstract: In exciting recent work, Bertsimas, King and Mazumder (Ann. Statist. 44 (2016) 813–852) showed that the classical best subset selection problem in regression modeling can be formulated as a mixed integer optimization (MIO) problem. Using recent advances in MIO algorithms, they demonstrated that best subset selection can now be solved at much larger problem sizes than what was thought possible in the statistics community. They presented empirical comparisons of best subset with other popular variable selection procedures, in particular, the lasso and forward stepwise selection. Surprisingly (to us), their simulations suggested that best subset consistently outperformed both methods in terms of prediction accuracy. Here, we present an expanded set of simulations to shed more light on these comparisons. The summary is roughly as follows: •neither best subset nor the lasso uniformly dominate the other, with best subset generally performing better in very high signal-to-noise (SNR) ratio regimes, and the lasso better in low SNR regimes; •for a large proportion of the settings considered, best subset and forward stepwise perform similarly, but in certain cases in the high SNR regime, best subset performs better; •forward stepwise and best subsets tend to yield sparser models (when tuned on a validation set), especially in the high SNR regime; •the relaxed lasso (actually, a simplified version of the original relaxed estimator defined in Meinshausen (Comput. Statist. Data Anal. 52 (2007) 374–393)) is the overall winner, performing just about as well as the lasso in low SNR scenarios, and nearly as well as best subset in high SNR scenarios.

Journal ArticleDOI
TL;DR: The use of desorption electrospray ionization mass spectrometry imaging (DESI‐MSI) is demonstrated as a molecular diagnostic and prognostic tool for clear cell renal cell carcinoma (ccRCC) and could be used for rapid intraoperative assessment of surgical margin status.
Abstract: Clear cell renal cell carcinoma (ccRCC) is the most common and lethal subtype of kidney cancer. Intraoperative frozen section (IFS) analysis is used to confirm the diagnosis during partial nephrectomy. However, surgical margin evaluation using IFS analysis is time consuming and unreliable, leading to relatively low utilization. In our study, we demonstrated the use of desorption electrospray ionization mass spectrometry imaging (DESI-MSI) as a molecular diagnostic and prognostic tool for ccRCC. DESI-MSI was conducted on fresh-frozen 23 normal tumor paired nephrectomy specimens of ccRCC. An independent validation cohort of 17 normal tumor pairs was analyzed. DESI-MSI provides two-dimensional molecular images of tissues with mass spectra representing small metabolites, fatty acids and lipids. These tissues were subjected to histopathologic evaluation. A set of metabolites that distinguish ccRCC from normal kidney were identified by performing least absolute shrinkage and selection operator (Lasso) and log-ratio Lasso analysis. Lasso analysis with leave-one-patient-out cross-validation selected 57 peaks from over 27,000 metabolic features across 37,608 pixels obtained using DESI-MSI of ccRCC and normal tissues. Baseline Lasso of metabolites predicted the class of each tissue to be normal or cancerous tissue with an accuracy of 94 and 76%, respectively. Combining the baseline Lasso with the ratio of glucose to arachidonic acid could potentially reduce scan time and improve accuracy to identify normal (82%) and ccRCC (88%) tissue. DESI-MSI allows rapid detection of metabolites associated with normal and ccRCC with high accuracy. As this technology advances, it could be used for rapid intraoperative assessment of surgical margin status.

Posted ContentDOI
17 Aug 2020-medRxiv
TL;DR: Outpatient and asymptomatic individuals' serological responses to SARS-CoV-2 decreased within 2 months, suggesting that humoral protection may be short-lived.
Abstract: SARS-CoV-2-specific antibodies, particularly those preventing viral spike receptor binding domain (RBD) interaction with host angiotensin-converting enzyme 2 (ACE2) receptor, could offer protective immunity, and may affect clinical outcomes of COVID-19 patients. We analyzed 625 serial plasma samples from 40 hospitalized COVID-19 patients and 170 SARS-CoV-2-infected outpatients and asymptomatic individuals. Severely ill patients developed significantly higher SARS-CoV-2-specific antibody responses than outpatients and asymptomatic individuals. The development of plasma antibodies was correlated with decreases in viral RNAemia, consistent with potential humoral immune clearance of virus. Using a novel competition ELISA, we detected antibodies blocking RBD-ACE2 interactions in 68% of inpatients and 40% of outpatients tested. Cross-reactive antibodies recognizing SARS-CoV RBD were found almost exclusively in hospitalized patients. Outpatient and asymptomatic individuals’ serological responses to SARS-CoV-2 decreased within 2 months, suggesting that humoral protection may be short-lived.

Journal ArticleDOI
TL;DR: A generalization of the lasso that allows the model coefficients to vary as a function of a general set of some prespecified modifying variables, which might be variables such as gender, age, or time is proposed.
Abstract: We propose a generalization of the lasso that allows the model coefficients to vary as a function of a general set of some prespecified modifying variables. These modifiers might be variables such ...

Journal ArticleDOI
TL;DR: A scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the Lasso partial likelihood function, based on the Batch Screening Iterative Lasso method developed in Qian and others (2019).
Abstract: We develop a scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the $L^1$-regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso (BASIL) method developed in Qian and others (2019). Our algorithm is particularly suitable for large-scale and high-dimensional data that do not fit in the memory. The output of our algorithm is the full Lasso path, the parameter estimates at all predefined regularization parameters, as well as their validation accuracy measured using the concordance index (C-index) or the validation deviance. To demonstrate the effectiveness of our algorithm, we analyze a large genotype-survival time dataset across 306 disease outcomes from the UK Biobank (Sudlow and others, 2015). We provide a publicly available implementation of the proposed approach for genetics data on top of the PLINK2 package and name it snpnet-Cox.

Journal ArticleDOI
TL;DR: Monitoring of peanut-specific CD4+ T cells, using MHC-peptide Dextramers, over the course of OIT found a transient increase in TGFβ-producing cells at 52 weeks in those with successful desensitization, and single cell TCRαβ repertoire sequences were too diverse to track clones over time.

Posted ContentDOI
31 May 2020-bioRxiv
TL;DR: A novel computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size is proposed.
Abstract: The UK Biobank (Bycroft et al., 2018) is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with GWAS, have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso (Tibshirani, 1996), since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet (Friedman et al., 2010a) and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports l1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with l1/l2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve superior predictive performance on quantitative and qualitative traits including height, body mass index, asthma and high cholesterol.

Journal ArticleDOI
TL;DR: A mixed data frame (MDF) is a table collecting categorical, numerical, and count observations as mentioned in this paper, which is used in statistics applications such as abundance data in ecology.
Abstract: A mixed data frame (MDF) is a table collecting categorical, numerical, and count observations. The use of MDF is widespread in statistics and the applications are numerous from abundance data in ec...

Posted ContentDOI
21 Jan 2020-bioRxiv
TL;DR: A scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the L1-regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso method developed in (Qian et al. 2019).
Abstract: We develop a scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the L1-regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso (BASIL) method developed in (Qian et al. 2019). The output of our algorithm is the full Lasso path, the parameter estimates at all predefined regularization parameters, as well as their validation accuracy measured using the concordance index (C-index) or the validation deviance. To demonstrate the effectiveness of our algorithm, we analyze a large genotype-survival time dataset across 306 disease outcomes from the UK Biobank (Sudlow et al. 2015). Our approach, which we refer to as snpnet-Cox, is implemented in a publicly available package.

Posted ContentDOI
31 May 2020-bioRxiv
TL;DR: This work proposes a scalable iterative algorithm based on adaptive screening that leverages the sparsity assumption and enables the method to cope with practical issues such as the inclusion of confounding variables and imputation of missing values among the phenotypes.
Abstract: In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that underlie the data. Our proposal is motivated by the UK Biobank population-based cohort study, where we are faced with large-scale, ultrahigh-dimensional features, and have access to a large number of outcomes (phenotypes): lifestyle measures, biomarkers, and disease outcomes. We are hence led to fit sparse reduced-rank regression models, using computational strategies that allow us to scale to problems of this size. We use an iterative algorithm that alternates between solving the sparse regression problem and solving the reduced rank decomposition. For the sparse regression component, we propose a scalable iterative algorithm based on adaptive screening that leverages the sparsity assumption and enables us to focus on solving much smaller sub-problems. The full solution is reconstructed and tested via an optimality condition to make sure it is a valid solution for the original problem. We further extend the method to cope with practical issues such as the inclusion of confounding variables and imputation of missing values among the phenotypes. Experiments on both synthetic data and the UK Biobank data demonstrate the effectiveness of the method and the algorithm. We present multiSnpnet package, available at http://github.com/junyangq/multiSnpnet that works on top of PLINK2 files, which we anticipate to be a valuable tool for generating polygenic risk scores from human genetic studies.

Posted Content
TL;DR: A novel matching distance derived from proximity scores in random forests is introduced and a match‐then‐split principle is proposed for the assessment with cross‐validation of the accuracy of heterogeneous treatment effect estimation.
Abstract: We study the assessment of the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard computation of prediction errors is not applicable. To tackle the difficulty, we propose an assessment approach by constructing pseudo-observations of the HTE based on matching. Our contributions are three-fold: first, we introduce a novel matching distance derived from proximity scores in random forests; second, we formulate the matching problem as an average minimum-cost flow problem and provide an efficient algorithm; third, we propose a match-then-split principle for the assessment with cross-validation. We demonstrate the efficacy of the assessment approach on synthetic data and data generated from a real dataset.

Posted Content
TL;DR: The feature-weighted elastic net ("fwelnet") is proposed, which uses these "features of features" to adapt the relative penalties on the feature coefficients in the elastic net penalty to improve prediction.
Abstract: In some supervised learning settings, the practitioner might have additional information on the features used for prediction. We propose a new method which leverages this additional information for better prediction. The method, which we call the feature-weighted elastic net ("fwelnet"), uses these "features of features" to adapt the relative penalties on the feature coefficients in the elastic net penalty. In our simulations, fwelnet outperforms the lasso in terms of test mean squared error and usually gives an improvement in true positive rate or false positive rate for feature selection. We also apply this method to early prediction of preeclampsia, where fwelnet outperforms the lasso in terms of 10-fold cross-validated area under the curve (0.86 vs. 0.80). We also provide a connection between fwelnet and the group lasso and suggest how fwelnet might be used for multi-task learning.

Posted ContentDOI
20 Jun 2020-medRxiv
TL;DR: Insight is provided into the association of social distancing on community mortality while accounting for key community factors and declines in mobility were associated with up to 15% lower mortality rates relative to pre-social distancing levels of mobility.
Abstract: The United States has become an epicenter for the coronavirus disease 2019 (COVID-19) pandemic However, communities have been unequally affected and evidence is growing that social determinants of health may be exacerbating the pandemic Furthermore, the impact and timing of social distancing at the community level have yet to be fully explored We investigated the relative associations between COVID-19 mortality and social distancing, sociodemographic makeup, economic vulnerabilities, and comorbidities in 24 counties surrounding 7 major metropolitan areas in the US using a flexible and robust time series modeling approach We found that counties with poorer health and less wealth were associated with higher daily mortality rates compared to counties with fewer economic vulnerabilities and fewer pre-existing health conditions Declines in mobility were associated with up to 15% lower mortality rates relative to pre-social distancing levels of mobility, but effects were lagged between 25-30 days While we cannot estimate causal impact, this study provides insight into the association of social distancing on community mortality while accounting for key community factors For full transparency and reproducibility, we provide all data and code used in this study One-sentence summary County-level disparities in COVID19 mortality highlight inequalities in socioeconomic and community factors and delayed effects of social distancing

Journal ArticleDOI
TL;DR: This procedure deletes each chosen predictor and refits the lasso to get a set of models that are “close” to the chosen “base model,” and compares the error rates of the base model with that of nearby models.
Abstract: We propose a simple method for evaluating the model that has been chosen by an adaptive regression procedure, our main focus being the lasso. This procedure deletes each chosen predictor and refits the lasso to get a set of models that are "close" to the one chosen, referred to as "base model". If the deletion of a predictor leads to significant deterioration in the model's predictive power, the predictor is called indispensable; otherwise, the nearby model is called acceptable and can serve as a good alternative to the base model. This provides both an assessment of the predictive contribution of each variable and a set of alternative models that may be used in place of the chosen model. In this paper, we will focus on the cross-validation (CV) setting and a model's predictive power is measured by its CV error, with base model tuned by cross-validation. We propose a method for comparing the error rates of the base model with that of nearby models, and a p-value for testing whether a predictor is dispensable. We also propose a new quantity called model score which works similarly as the p-value for the control of type I error. Our proposal is closely related to the LOCO (leave-one-covarate-out) methods of ([Rinaldo 2016 Bootstrapping]) and less so, to Stability Selection ([Meinshausen 2010 stability]). We call this procedure "Next-Door analysis" since it examines models close to the base model. It can be applied to Gaussian regression data, generalized linear models, and other supervised learning problems with $\ell_1$ penalization. It could also be applied to best subset and stepwise regression procedures. We have implemented it in the R language as a library to accompany the well-known {\tt glmnet} library.

Journal ArticleDOI
TL;DR: A multi‐stage algorithm, called reluctant generalised additive modelling (RGAM), that can fit sparse GAMs at scale and is guided by the principle that, if all else is equal, one should prefer a linear feature over a non‐linear feature.
Abstract: Sparse generalised additive models (GAMs) are an extension of sparse generalised linear models that allow a model's prediction to vary non‐linearly with an input variable. This enables the data analyst build more accurate models, especially when the linearity assumption is known to be a poor approximation of reality. Motivated by reluctant interaction modelling, we propose a multi‐stage algorithm, called reluctant generalised additive modelling (RGAM), that can fit sparse GAMs at scale. It is guided by the principle that, if all else is equal, one should prefer a linear feature over a non‐linear feature. Unlike existing methods for sparse GAMs, RGAM can be extended easily to binary, count and survival data. We demonstrate the method's effectiveness on real and simulated examples.

Journal ArticleDOI
TL;DR: There is more of a continuum between the old and new methodology, and the opportunity for both to improve through their synergy.
Abstract: Professor Efron has presented us with a thought‐provoking paper on the relationship between prediction, estimation, and attribution in the modern era of data science. While we appreciate many of his arguments, we see more of a continuum between the old and new methodology, and the opportunity for both to improve through their synergy.

Posted ContentDOI
02 Dec 2020-bioRxiv
TL;DR: A Sparse-Group regularized Cox regression method to improve the prediction performance of large-scale and high-dimensional survival data with few observed events is proposed and its efficacy is demonstrated through simulations and applications to UK Biobank data.
Abstract: {We propose a Sparse-Group regularized Cox regression method to improve the prediction performance of large-scale and high-dimensional survival data with few observed events. Our approach is applicable when there is one or more other survival responses that 1. has a large number of observed events; 2. share a common set of associated predictors with the rare event response. This scenario is common in the UK Biobank \citep{UKBB} dataset where records for a large number of common and rare diseases of the same set of individuals are available. By analyzing these responses together, we hope to achieve higher prediction performance than when they are analyzed individually. To make this approach practical for large-scale data, we developed an accelerated proximal gradient optimization algorithm as well as a screening procedure inspired by \cite{basil}. We provide a software implementation of the proposed method and demonstrate its efficacy through simulations and applications to UK Biobank data.

Posted Content
TL;DR: This work presents a method that exploits any existing relationship between between illness severity and treatment effect, and searches for the "sweet spot", the contiguous range of illness severity where the estimated treatment benefit is maximized.
Abstract: Identifying heterogeneous treatment effects (HTEs) in randomized controlled trials is an important step toward understanding and acting on trial results. However, HTEs are often small and difficult to identify, and HTE modeling methods which are very general can suffer from low power. We present a method that exploits any existing relationship between illness severity and treatment effect, and identifies the "sweet spot", the contiguous range of illness severity where the estimated treatment benefit is maximized. We further compute a bias-corrected estimate of the conditional average treatment effect (CATE) in the sweet spot, and a $p$-value. Because we identify a single sweet spot and $p$-value, we believe our method to be straightforward to interpret and actionable: results from our method can inform future clinical trials and help clinicians make personalized treatment recommendations.

Journal ArticleDOI
TL;DR: The approach for analyzing a randomized trial could help identify a potential sweet-spot of an accentuated treatment effect, and the discrepancy between crude and stratified analyses could be visualized by graphical displays and replicated with matched comparisons.

Posted ContentDOI
27 Feb 2020-bioRxiv
TL;DR: A machine learning platform, the immunological Elastic-Net (iEN), is introduced, which incorporates immunological knowledge directly into the predictive models, allowing for the inclusion of immune features with strong predictive power even if not consistent with prior knowledge.
Abstract: The dense network of interconnected cellular signaling responses quantifiable in peripheral immune cells provide a wealth of actionable immunological insights. While high-throughput single-cell profiling techniques, including polychromatic flow and mass cytometry, have matured to a point that enables detailed immune profiling of patients in numerous clinical settings, limited cohort size together with the high dimensionality of data increases the possibility of false positive discoveries and model overfitting. We introduce a machine learning platform, the immunological Elastic-Net (iEN), which incorporates immunological knowledge directly into the predictive models. Importantly, the algorithm maintains the exploratory nature of the high-dimensional dataset, allowing for the inclusion of immune features with strong predictive power even if not consistent with prior knowledge. In three independent studies our method demonstrates improved predictive power for clinically-relevant outcomes from mass cytometry data generated from whole blood, as well as a large simulated dataset.