scispace - formally typeset
Search or ask a question

Showing papers by "Robert Tibshirani published in 2021"


Book
29 Jul 2021
TL;DR: This book presents some of the most important modeling and prediction techniques, along with relevant applications, that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years.
Abstract: An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform. Two of the authors co-wrote The Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2nd edition 2009), a popular reference book for statistics and machine learning researchers. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use cutting-edge statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra.

3,439 citations


Journal ArticleDOI
TL;DR: In this article, the genetic basis of 35 blood and urine laboratory measurements in the UK Biobank (n = 363,228 individuals) was evaluated and the results delineate the genetic underlying of biomarkers and their causal influences on diseases and improve genetic risk stratification for common diseases.
Abstract: Clinical laboratory tests are a critical component of the continuum of care. We evaluate the genetic basis of 35 blood and urine laboratory measurements in the UK Biobank (n = 363,228 individuals). We identify 1,857 loci associated with at least one trait, containing 3,374 fine-mapped associations and additional sets of large-effect (>0.1 s.d.) protein-altering, human leukocyte antigen (HLA) and copy number variant (CNV) associations. Through Mendelian randomization (MR) analysis, we discover 51 causal relationships, including previously known agonistic effects of urate on gout and cystatin C on stroke. Finally, we develop polygenic risk scores (PRSs) for each biomarker and build 'multi-PRS' models for diseases using 35 PRSs simultaneously, which improved chronic kidney disease, type 2 diabetes, gout and alcoholic cirrhosis genetic risk stratification in an independent dataset (FinnGen; n = 135,500) relative to single-disease PRSs. Together, our results delineate the genetic basis of biomarkers and their causal influences on diseases and improve genetic risk stratification for common diseases.

262 citations


Journal ArticleDOI
01 Jul 2021
TL;DR: A key role is identified in age-related chronic inflammation of CXCL9 in cardiac aging, adverse cardiac remodeling and poor vascular function and a metric for multimorbidity is derived that can be utilized for the early detection of age- related clinical phenotypes.
Abstract: While many diseases of aging have been linked to the immunological system, immune metrics capable of identifying the most at-risk individuals are lacking. From the blood immunome of 1,001 individuals aged 8–96 years, we developed a deep-learning method based on patterns of systemic age-related inflammation. The resulting inflammatory clock of aging (iAge) tracked with multimorbidity, immunosenescence, frailty and cardiovascular aging, and is also associated with exceptional longevity in centenarians. The strongest contributor to iAge was the chemokine CXCL9, which was involved in cardiac aging, adverse cardiac remodeling and poor vascular function. Furthermore, aging endothelial cells in human and mice show loss of function, cellular senescence and hallmark phenotypes of arterial stiffness, all of which are reversed by silencing CXCL9. In conclusion, we identify a key role of CXCL9 in age-related chronic inflammation and derive a metric for multimorbidity that can be utilized for the early detection of age-related clinical phenotypes. From the blood immunome of 1,001 individuals aged 8–96 years, the authors used deep learning to develop an inflammatory clock of aging (iAge) that tracks with multimorbidity, immunosenescence, frailty and cardiovascular aging, and is also associated with exceptional longevity in centenarians. The main contributor to iAge is the chemokine CXCL9, which is shown to control endothelial cell senescence and function.

155 citations


Posted ContentDOI
Estee Y Cramer1, Evan L. Ray1, Velma K. Lopez2, Johannes Bracher3  +281 moreInstitutions (53)
05 Feb 2021-medRxiv
TL;DR: In this paper, the authors systematically evaluated 23 models that regularly submitted forecasts of reported weekly incident COVID-19 mortality counts in the US at the state and national level at the CDC.
Abstract: Short-term probabilistic forecasts of the trajectory of the COVID-19 pandemic in the United States have served as a visible and important communication channel between the scientific modeling community and both the general public and decision-makers Forecasting models provide specific, quantitative, and evaluable predictions that inform short-term decisions such as healthcare staffing needs, school closures, and allocation of medical supplies In 2020, the COVID-19 Forecast Hub (https://covid19forecasthuborg/) collected, disseminated, and synthesized hundreds of thousands of specific predictions from more than 50 different academic, industry, and independent research groups This manuscript systematically evaluates 23 models that regularly submitted forecasts of reported weekly incident COVID-19 mortality counts in the US at the state and national level One of these models was a multi-model ensemble that combined all available forecasts each week The performance of individual models showed high variability across time, geospatial units, and forecast horizons Half of the models evaluated showed better accuracy than a naive baseline model In combining the forecasts from all teams, the ensemble showed the best overall probabilistic accuracy of any model Forecast accuracy degraded as models made predictions farther into the future, with probabilistic accuracy at a 20-week horizon more than 5 times worse than when predicting at a 1-week horizon This project underscores the role that collaboration and active coordination between governmental public health agencies, academic modeling teams, and industry partners can play in developing modern modeling capabilities to support local, state, and federal response to outbreaks

68 citations


Posted ContentDOI
25 Jun 2021-medRxiv
TL;DR: In this paper, the utility of these indicators from a forecasting perspective is studied. But the authors focus on five indicators, derived from medical insurance claims data, web search queries, and online survey responses, and ask whether their inclusion in a simple model leads to improved predictive accuracy relative to a similar model excluding it.
Abstract: Reliable, short-term forecasts of traditional public health reporting streams (such as cases, hospitalizations, and deaths) are a key ingredient in effective public health decision-making during a pandemic. Since April 2020, our research group has worked with data partners to collect, curate, and make publicly available numerous real-time COVID-19 indicators, providing multiple views of pandemic activity. This paper studies the utility of these indicators from a forecasting perspective. We focus on five indicators, derived from medical insurance claims data, web search queries, and online survey responses. For each indicator, we ask whether its inclusion in a simple model leads to improved predictive accuracy relative to a similar model excluding it. We consider both probabilistic forecasting of confirmed COVID-19 case rates and binary prediction of case "hotspots". Since the values of indicators (and case rates) are commonly revised over time, we take special care to ensure that the data provided to a forecaster is the version that would have been available at the time the forecast was made. Our analysis shows that consistent but modest gains in predictive accuracy are obtained by using these indicators, and furthermore, these gains are related to periods in which the auxiliary indicators behave as "leading indicators" of case rates.

23 citations


Journal ArticleDOI
TL;DR: SparseSignatures as mentioned in this paper uses a user-specified background signature, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets.
Abstract: Cancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or "mutational signatures". Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates a user-specified background signature, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets. We show that SparseSignatures outperforms current state-of-the-art methods on simulated data using a variety of standard metrics. We then apply SparseSignatures to whole genome sequences of pancreatic and breast tumors, discovering well-differentiated signatures that are linked to known mutagenic mechanisms and are strongly associated with patient clinical features.

18 citations


Journal ArticleDOI
TL;DR: A penalized Cox proportional hazards model is applied for left-truncated and right-censored survival data and implications of left truncation adjustment on bias and interpretation are assessed.
Abstract: High-dimensional data are becoming increasingly common in the medical field as large volumes of patient information are collected and processed by high-throughput screening, electronic health records, and comprehensive genomic testing. Statistical models that attempt to study the effects of many predictors on survival typically implement feature selection or penalized methods to mitigate the undesirable consequences of overfitting. In some cases survival data are also left-truncated which can give rise to an immortal time bias, but penalized survival methods that adjust for left truncation are not commonly implemented. To address these challenges, we apply a penalized Cox proportional hazards model for left-truncated and right-censored survival data and assess implications of left truncation adjustment on bias and interpretation. We use simulation studies and a high-dimensional, real-world clinico-genomic database to highlight the pitfalls of failing to account for left truncation in survival modeling.

17 citations


Journal ArticleDOI
01 Mar 2021-Allergy
TL;DR: Asthma prolongs intubation in COVID-19 and race is associated with differences in airway inflammation in patients with asthma, and eosinophil responses during COIDs infections and coronavirus vaccination.
Abstract: R E FE R E N C E S 1. Carli G, Cecchi L, Stebbing J, Parronchi P, Farsi A. Is asthma protective against COVID-19? Allergy. 2021;76:866–936. 2. Mahdavinia M, Foster KJ, Jauregui E, et al. Asthma prolongs intubation in COVID-19. J Allergy Clin Immunol Pract. 2020;8(7):2388-2391. 3. Hooper MW, Nápoles AM, Pérez-Stable EJ. COVID-19 and racial/ethnic disparities. JAMA. 2020;323(24):2466-2467. 4. Noonan AS, Velasco-Mondragon HE, Wagner FA. Improving the health of African Americans in the USA: an overdue opportunity for social justice. Public Health Rev. 2016;37(1):1-20. 5. Nyenhuis SM, Krishnan JA, Berry A, et al. Race is associated with differences in airway inflammation in patients with asthma. J Allergy Clin Immunol. 2017;140(1):257-265.e211. 6. Lindsley AW, Schwartz JT, Rothenberg ME. Eosinophil responses during COVID-19 infections and coronavirus vaccination. J Allergy Clin Immunol. 2020;146(1):1-7.

15 citations


Posted Content
TL;DR: This paper showed that the standard confidence intervals for prediction error derived from cross-validation may have coverage far below the desired level because each data point is used for both training and testing, and so the usual estimate of variance is too small.
Abstract: Cross-validation is a widely-used technique to estimate prediction error, but its behavior is complex and not fully understood Ideally, one would like to think that cross-validation estimates the prediction error for the model at hand, fit to the training data We prove that this is not the case for the linear model fit by ordinary least squares; rather it estimates the average prediction error of models fit on other unseen training sets drawn from the same population We further show that this phenomenon occurs for most popular estimates of prediction error, including data splitting, bootstrapping, and Mallow's Cp Next, the standard confidence intervals for prediction error derived from cross-validation may have coverage far below the desired level Because each data point is used for both training and testing, there are correlations among the measured accuracies for each fold, and so the usual estimate of variance is too small We introduce a nested cross-validation scheme to estimate this variance more accurately, and show empirically that this modification leads to intervals with approximately correct coverage in many examples where traditional cross-validation intervals fail Lastly, our analysis also shows that when producing confidence intervals for prediction accuracy with simple data splitting, one should not re-fit the model on the combined data, since this invalidates the confidence intervals

13 citations


Journal ArticleDOI
TL;DR: In this paper, a latent factor model of genetic risk based on components from Decomposition of Genetic Associations (DeGAs) is introduced, which is called the DeGAs polygenic risk score (dPRS).
Abstract: Polygenic risk models have led to significant advances in understanding complex diseases and their clinical presentation. While polygenic risk scores (PRS) can effectively predict outcomes, they do not generally account for disease subtypes or pathways which underlie within-trait diversity. Here, we introduce a latent factor model of genetic risk based on components from Decomposition of Genetic Associations (DeGAs), which we call the DeGAs polygenic risk score (dPRS). We compute DeGAs using genetic associations for 977 traits and find that dPRS performs comparably to standard PRS while offering greater interpretability. We show how to decompose an individual's genetic risk for a trait across DeGAs components, with examples for body mass index (BMI) and myocardial infarction (heart attack) in 337,151 white British individuals in the UK Biobank, with replication in a further set of 25,486 non-British white individuals. We find that BMI polygenic risk factorizes into components related to fat-free mass, fat mass, and overall health indicators like physical activity. Most individuals with high dPRS for BMI have strong contributions from both a fat-mass component and a fat-free mass component, whereas a few "outlier" individuals have strong contributions from only one of the two components. Overall, our method enables fine-scale interpretation of the drivers of genetic risk for complex traits.

13 citations


Journal ArticleDOI
TL;DR: In this article, the MasSpec Pen technology integrated to electrospray ionization (ESI) was used for direct analysis of clinical swabs and investigate its use for COVID-19 screening.
Abstract: The outbreak of COVID-19 has created an unprecedent global crisis. While the polymerase chain reaction (PCR) is the gold standard method for detecting active SARS-CoV-2 infection, alternative high-throughput diagnostic tests are of a significant value to meet universal testing demands. Here, we describe a new design of the MasSpec Pen technology integrated to electrospray ionization (ESI) for direct analysis of clinical swabs and investigate its use for COVID-19 screening. The redesigned MasSpec Pen system incorporates a disposable sampling device refined for uniform and efficient analysis of swab tips via liquid extraction directly coupled to an ESI source. Using this system, we analyzed nasopharyngeal swabs from 244 individuals including symptomatic COVID-19 positive, symptomatic negative, and asymptomatic negative individuals, enabling rapid detection of rich lipid profiles. Two statistical classifiers were generated based on the lipid information acquired. Classifier 1 was built to distinguish symptomatic PCR-positive from asymptomatic PCR-negative individuals, yielding a cross-validation accuracy of 83.5%, sensitivity of 76.6%, and specificity of 86.6%, and validation set accuracy of 89.6%, sensitivity of 100%, and specificity of 85.3%. Classifier 2 was built to distinguish symptomatic PCR-positive patients from negative individuals including symptomatic PCR-negative patients with moderate to severe symptoms and asymptomatic individuals, yielding a cross-validation accuracy of 78.4%, specificity of 77.21%, and sensitivity of 81.8%. Collectively, this study suggests that the lipid profiles detected directly from nasopharyngeal swabs using MasSpec Pen-ESI mass spectrometry (MS) allow fast (under a minute) screening of the COVID-19 disease using minimal operating steps and no specialized reagents, thus representing a promising alternative high-throughput method for screening of COVID-19.

Journal Article
TL;DR: LassoNet as discussed by the authors enforces a hierarchy: a feature can participate in a hidden unit only if its linear representative is active, and integrates feature selection with the parameter learning directly, as a result it delivers an entire regularization path of solutions with a range of feature sparsity.
Abstract: Much work has been done recently to make neural networks more interpretable, and one obvious approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or $\ell_1$-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. Our approach enforces a hierarchy: specifically a feature can participate in a hidden unit only if its linear representative is active. Unlike other approaches to feature selection for neural nets, our method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. On systematic experiments, LassoNet significantly outperforms state-of-the-art methods for feature selection and regression. The LassoNet method uses projected proximal gradient descent, and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network.

Posted ContentDOI
06 Sep 2021-medRxiv
TL;DR: In this article, a systematic assessment of polygenic risk score (PRS) prediction across more than 1,600 traits using genetic and phenotype data in the UK Biobank is presented.
Abstract: We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,600 traits using genetic and phenotype data in the UK Biobank. We report 428 sparse PRS models with significant (p < 2.5 x 10-5) incremental predictive performance when compared against the covariate-only model that considers age, sex, and the genotype principal components. We report a significant correlation between the number of genetic variants selected in the sparse PRS model and the incremental predictive performance in quantitative traits (Spearmans {rho} = 0.54, p = 1.4 x 10-15), but not in binary traits ({rho} = 0.059, p = 0.35). The sparse PRS model trained on European individuals showed limited transferability when evaluated on individuals from non-European individuals in the UK Biobank. We provide the PRS model weights on the Global Biobank Engine (https://biobankengine.stanford.edu/prs).

Journal ArticleDOI
TL;DR: The Stanford Prostate Cancer Calculator (SPCC) as mentioned in this paper combines pre-biopsy mpMRI with clinical data to more accurately predict the probability of CSC in men of all biopsy backgrounds.
Abstract: Background While multiparametric MRI (mpMRI) has high sensitivity for detection of clinically significant prostate cancer (CSC), false positives and negatives remain common. Calculators that combine mpMRI with clinical variables can improve cancer risk assessment, while providing more accurate predictions for individual patients. We sought to create and externally validate nomograms incorporating Prostate Imaging Reporting and Data System (PIRADS) scores and clinical data to predict the presence of CSC in men of all biopsy backgrounds. Methods Data from 2125 men undergoing mpMRI and MR fusion biopsy from 2014 to 2018 at Stanford, Yale, and UAB were prospectively collected. Clinical data included age, race, PSA, biopsy status, PIRADS scores, and prostate volume. A nomogram predicting detection of CSC on targeted or systematic biopsy was created. Results Biopsy history, Prostate Specific Antigen (PSA) density, PIRADS score of 4 or 5, Caucasian race, and age were significant independent predictors. Our nomogram—the Stanford Prostate Cancer Calculator (SPCC)—combined these factors in a logistic regression to provide stronger predictive accuracy than PSA density or PIRADS alone. Validation of the SPCC using data from Yale and UAB yielded robust AUC values. Conclusions The SPCC combines pre-biopsy mpMRI with clinical data to more accurately predict the probability of CSC in men of all biopsy backgrounds. The SPCC demonstrates strong external generalizability with successful validation in two separate institutions. The calculator is available as a free web-based tool that can direct real-time clinical decision-making.

Journal ArticleDOI
TL;DR: In this article, Ravi et al. developed two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals.
Abstract: Motivation Large-scale and high-dimensional genome sequencing data poses computational challenges. General purpose optimization tools are usually not optimal in terms of computational and memory performance for genetic data. Results We develop two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals. These genetic variants are encoded by the values in the set {0, 1, 2, NA}. We take advantage of this fact and use two bits to represent each entry in a genetic matrix, which reduces memory requirement by a factor of 32 compared to a double precision floating point representation. Using this representation, we implemented an iteratively reweighted least square algorithm to solve Lasso regressions on genetic matrices, which we name snpnet-2.0. When the dataset contains many rare variants, the predictors can be encoded in a sparse matrix. We utilize the sparsity in the predictor matrix to further reduce memory requirement and computational speed. Our sparse genetic matrix implementation uses both the compact 2-bit representation and a simplified version of compressed sparse block format so that matrix-vector multiplications can be effectively parallelized on multiple CPU cores. To demonstrate the effectiveness of this representation, we implement an accelerated proximal gradient method to solve group Lasso on these sparse genetic matrices. This solver is named sparse-snpnet, and will also be included as part of snpnet R package. Our implementation is able to solve Lasso and group Lasso, linear, logistic and Cox regression problems on sparse genetic matrices that contain 1,000,000 variants and almost 100,000 individuals within 10 minutes and using less than 32GB of memory. Availability https://github.com/rivas-lab/snpnet/tree/compact.


Posted ContentDOI
16 Feb 2021-bioRxiv
TL;DR: In this paper, the authors developed two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals.
Abstract: We develop two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals. These genetic variants are encoded by the values in the set {0, 1, 2, NA}. We take advantage of this fact and use two bits to represent each entry in a genetic matrix, which reduces memory requirement by a factor of 32 compared to a double precision floating point representation. Using this representation, we implemented an iteratively reweighted least square algorithm to solve Lasso regressions on genetic matrices, which we name snpnet-2.0. When the dataset contains many rare variants, the predictors can be encoded in a sparse matrix. We utilize the sparsity in the predictor matrix to further reduce memory requirement and computational speed. Our sparse genetic matrix implementation uses both the compact 2-bit representation and a simplified version of compressed sparse block format so that matrix-vector multiplications can be effectively parallelized on multiple CPU cores. To demonstrate the effectiveness of this representation, we implement an accelerated proximal gradient method to solve group Lasso on these sparse genetic matrices. This solver is named sparse-snpnet, and will also be included as part of snpnet R package. Our implementation is able to solve group Lasso problems on sparse genetic matrices with more than 1,000,000 columns and almost 100,000 rows within 10 minutes and using less than 32GB of memory.


Posted ContentDOI
21 Jun 2021-medRxiv
TL;DR: The design and implementation of an on-demand consultation service to derive evidence from patient data to answer clinician questions and support their bedside decision making and the tools and methods developed are made publicly available to facilitate the broad adoption of such services by health systems and academic medical centers.
Abstract: Using evidence derived from previously collected medical records to guide patient care has been a long standing vision of clinicians and informaticians, and one with the potential to transform medical practice. As a result of advances in technical infrastructure, statistical analysis methods, and the availability of patient data at scale, an implementation of this vision is now possible. Motivated by these advances, and the information needs of clinicians in our academic medical center, we offered an on-demand consultation service to derive evidence from patient data to answer clinician questions and support their bedside decision making. We describe the design and implementation of the service as well as a summary of our experience in responding to the first 100 requests. Consultation results informed individual patient care, resulted in changes to institutional practices, and motivated further clinical research. We make the tools and methods developed to implement the service publicly available to facilitate the broad adoption of such services by health systems and academic medical centers.

Journal ArticleDOI
TL;DR: MassExplorer as mentioned in this paper is a tool to pre-process DESI-MSI data, visualize raw data, build predictive models using the statistical lasso approach to select for a sparse set of significant molecular changes, and interpret selected metabolites.
Abstract: Summary In the last few years, desorption electrospray ionization mass spectrometry imaging (DESI-MSI) has been increasingly used for simultaneous detection of thousands of metabolites and lipids from human tissues and biofluids. To successfully find the most significant differences between two sets of DESI-MSI data (e.g., healthy vs disease) requires the application of accurate computational and statistical methods that can pre-process the data under various normalization settings and help identify these changes among thousands of detected metabolites. Here, we report MassExplorer, a novel computational tool, to help pre-process DESI-MSI data, visualize raw data, build predictive models using the statistical lasso approach to select for a sparse set of significant molecular changes, and interpret selected metabolites. This tool, which is available for both online and offline use, is flexible for both chemists and biologists and statisticians as it helps in visualizing structure of DESI-MSI data and in analyzing the statistically significant metabolites that are differentially expressed across both sample types. Based on the modules in MassExplorer, we expect it to be immediately useful for various biological and chemical applications in mass spectrometry. Availability and implementation MassExplorer is available as an online R-Shiny application or Mac OS X compatible standalone application. The application, sample performance, source code and corresponding guide can be found at: https://zarelab.com/research/massexplorer-a-tool-to-help-guide-analysis-of-mass-spectrometry-samples/. Supplementary informationMATION Supplementary data are available at Bioinformatics online.


Posted ContentDOI
24 Jul 2021-bioRxiv
TL;DR: In this article, a multiscale, integrated profiling of Ductal Carcinoma in situ (DCIS) with clinical outcomes was performed by analyzing 677 DCIS samples from 481 patients with 7.1 years median followup from the Translational Breast Cancer Research Consortium (TBCRC) 038 study and the Resource of Archival Breast Tissue (RAHBT) cohorts.
Abstract: SUMMARY Ductal carcinoma in situ (DCIS) is the most common precursor of invasive breast cancer (IBC), with variable propensity for progression. We have performed the first multiscale, integrated profiling of DCIS with clinical outcomes by analyzing 677 DCIS samples from 481 patients with 7.1 years median follow-up from the Translational Breast Cancer Research Consortium (TBCRC) 038 study and the Resource of Archival Breast Tissue (RAHBT) cohorts. We made observations on DNA, RNA, and protein expression, and generated a de novo clustering scheme for DCIS that represents a fundamental transcriptomic organization at this early stage of breast neoplasia. Distinct stromal expression patterns and immune cell compositions were identified. We found RNA expression patterns that correlate with later events. Our multiscale approach employed in situ methods to generate a spatially resolved atlas of breast precancers, where complementary modalities can be directly compared and correlated with conventional pathology findings, disease states, and clinical outcome. HIGHLIGHTS New transcriptomic classification solution reveals 3 major subgroups in DCIS. Four stroma-specific signatures identified. utcome analysis identifies pathways involved in DCIS progression. CNAs characterize high risk of distant relapse IBC subtypes observed in DCIS.

Posted ContentDOI
16 Jul 2021-medRxiv
TL;DR: The COVIDcast API as mentioned in this paper provides open access to both traditional public health surveillance signals (cases, deaths, and hospitalizations) and many auxiliary indicators of COVID-19 activity, such as signals extracted from de-identified medical claims data, massive online surveys, cell phone mobility data, and internet search trends.
Abstract: The COVID-19 pandemic presented enormous data challenges in the United States. Policy makers, epidemiological modelers, and health researchers all require up-to-date data on the pandemic and relevant public behavior, ideally at fine spatial and temporal resolution. The COVIDcast API is our attempt to fill this need: operational since April 2020, it provides open access to both traditional public health surveillance signals (cases, deaths, and hospitalizations) and many auxiliary indicators of COVID-19 activity, such as signals extracted from de-identified medical claims data, massive online surveys, cell phone mobility data, and internet search trends. These are available at a fine geographic resolution (mostly at the county level) and are updated daily. The COVIDcast API also tracks all revisions to historical data, allowing modelers to account for the frequent revisions and backfill that are common for many public health data sources. All of the data is available in a common format through the API and accompanying R and Python software packages. This paper describes the data sources and signals, and provides examples demonstrating that the auxiliary signals in the COVIDcast API present information relevant to tracking COVID activity, augmenting traditional public health reporting and empowering research and decision-making.

Journal ArticleDOI
TL;DR: In this paper, a sparse-group regularized Cox regression method was proposed to improve the prediction performance of large-scale and high-dimensional survival data with few observed events, which is applicable when there is one or more other survival responses that 1 has a large number of observed events; 2 share a common set of associated predictors with the rare event response.
Abstract: MOTIVATION The prediction performance of Cox proportional hazard model suffers when there are only few uncensored events in the training data RESULTS We propose a Sparse-Group regularized Cox regression method to improve the prediction performance of large-scale and high-dimensional survival data with few observed events Our approach is applicable when there is one or more other survival responses that 1 has a large number of observed events; 2 share a common set of associated predictors with the rare event response This scenario is common in the UK Biobank (Sudlow et al, 2015) dataset where records for a large number of common and less prevalent diseases of the same set of individuals are available By analyzing these responses together, we hope to achieve higher prediction performance than when they are analyzed individually To make this approach practical for large-scale data, we developed an accelerated proximal gradient optimization algorithm as well as a screening procedure inspired by Qian et al (2020) AVAILABILITY https://githubcom/rivas-lab/multisnpnet-Cox SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online

Journal ArticleDOI
TL;DR: In this article, a matching distance derived from proximity scores in random forests is proposed to assess the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard prediction errors is not applicable.
Abstract: We study the assessment of the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard computation of prediction errors is not applicable. To tackle the difficulty, we propose an assessment approach by constructing pseudo-observations of the HTE based on matching. Our contributions are three-fold: first, we introduce a novel matching distance derived from proximity scores in random forests; second, we formulate the matching problem as an average minimum-cost flow problem and provide an efficient algorithm; third, we propose a match-then-split principle for the assessment with cross-validation. We demonstrate the efficacy of the assessment approach using simulations and a real dataset.

Posted ContentDOI
19 May 2021-medRxiv
TL;DR: In this article, the MasSpec Pen technology integrated to electrospray ionization (ESI) was used for direct analysis of clinical swabs and investigate its use for COVID-19 screening.
Abstract: The outbreak of COVID-19 has created an unprecedent global crisis. While PCR is the gold standard method for detecting active SARS-CoV-2 infection, alternative high-throughput diagnostic tests are of significant value to meet universal testing demands. Here, we describe a new design of the MasSpec Pen technology integrated to electrospray ionization (ESI) for direct analysis of clinical swabs and investigate its use for COVID-19 screening. The redesigned MasSpec Pen system incorporates a disposable sampling device refined for uniform and efficient analysis of swab tips via liquid extraction directly coupled to a ESI source. Using this system, we analyzed nasopharyngeal swabs from 244 individuals including symptomatic COVID-19 positive, symptomatic negative, and asymptomatic negative individuals, enabling rapid detection of rich lipid profiles. Two statistical classifiers were generated based on the lipid information aquired. Classifier 1 was built to distinguish symptomatic PCR-positive from asymptomatic PCR-negative individuals, yielding cross-validation accuracy of 83.5%, sensitivity of 76.6%, and specificity of 86.6%, and validation set accuracy of 89.6%, sensitivity of 100%, and specificity of 85.3%. Classifier 2 was built to distinguish symptomatic PCR-positive patients from negative individuals including symptomatic PCR-negative patients with moderate to severe symptoms and asymptomatic individuals, yielding a cross-validation accuracy of 78.4% accuracy, specificity of 77.21%, and sensitivity of 81.8%. Collectively, this study suggests that the lipid profiles detected directly from nasopharyngeal swabs using MasSpec Pen-ESI MS allows fast (under a minute) screening of COVID-19 disease using minimal operating steps and no specialized reagents, thus representing a promising alternative high-throughput method for screening of COVID-19.

Journal ArticleDOI
TL;DR: Efron and Tibshirani as discussed by the authors used the Bootstrap Method for Assessing Statistical Accuracy (BMAR) to assess statistical accuracy. But their work was limited to a single dataset.
Abstract: The article The Bootstrap Method for Assessing Statistical Accuracy, written by Bradley Efron and Robert Tibshirani, was originally published Online First without Open Access.


Posted ContentDOI
12 Feb 2021-medRxiv
TL;DR: In this paper, a penalized Cox proportional hazards model for left-truncated and right-censored survival data was applied to assess the implications of left truncation adjustment on bias and interpretation.
Abstract: High-dimensional data are becoming increasingly common in the medical field as large volumes of patient information are collected and processed by high-throughput screening, electronic health records (EHRs), and comprehensive genomic testing. Statistical models that attempt to study the effects of many predictors on survival typically implement feature selection or penalized methods to mitigate the undesirable consequences of overfitting. In some cases survival data is also left-truncated which can give rise to an immortal time bias, but penalized survival methods that adjust for left truncation are not commonly implemented. To address these challenges, we apply a penalized Cox proportional hazards model for left-truncated and right-censored survival data and assess implications of left truncation adjustment on bias and interpretation. We use simulation studies and a high-dimensional, real-world clinico-genomic database (CGDB) to highlight the pitfalls of failing to account for left truncation in survival modeling.

Posted Content
TL;DR: In this article, a method for casting survival analysis problems as classification problems is presented, thereby allowing the use of general classification methods and software in a survival setting. But this method is not suitable for right-censored data.
Abstract: While there are many well-developed data science methods for classification and regression, there are relatively few methods for working with right-censored data. Here, we present "survival stacking": a method for casting survival analysis problems as classification problems, thereby allowing the use of general classification methods and software in a survival setting. Inspired by the Cox partial likelihood, survival stacking collects features and outcomes of survival data in a large data frame with a binary outcome. We show that survival stacking with logistic regression is approximately equivalent to the Cox proportional hazards model. We further recommend methods for evaluating model performance in the survival stacked setting, and we illustrate survival stacking on real and simulated data. By reframing survival problems as classification problems, we make it possible for data scientists to use well-known learning algorithms (including random forests, gradient boosting machines and neural networks) in a survival setting, and lower the barrier for flexible survival modeling.