scispace - formally typeset
Search or ask a question

Showing papers by "Robert Tibshirani published in 2018"


Journal ArticleDOI
TL;DR: Pretreatment ctDNA levels and molecular responses are independently prognostic of outcomes in aggressive lymphomas, and these risk factors could potentially guide future personalized risk-directed approaches.
Abstract: PurposeOutcomes for patients with diffuse large B-cell lymphoma remain heterogeneous, with existing methods failing to consistently predict treatment failure. We examined the additional prognostic value of circulating tumor DNA (ctDNA) before and during therapy for predicting patient outcomes.Patients and MethodsWe studied the dynamics of ctDNA from 217 patients treated at six centers, using a training and validation framework. We densely characterized early ctDNA dynamics during therapy using cancer personalized profiling by deep sequencing to define response-associated thresholds within a discovery set. These thresholds were assessed in two independent validation sets. Finally, we assessed the prognostic value of ctDNA in the context of established risk factors, including the International Prognostic Index and interim positron emission tomography/computed tomography scans.ResultsBefore therapy, ctDNA was detectable in 98% of patients; pretreatment levels were prognostic in both front-line and salvage se...

286 citations


Journal ArticleDOI
08 Jun 2018-Science
TL;DR: In pilot studies of pregnant women, RNA-based tests of maternal blood predicted delivery date and risk of early childbirth and hold promise for prenatal care in both the developed and developing worlds, although they require validation in larger, blinded clinical trials.
Abstract: Noninvasive blood tests that provide information about fetal development and gestational age could potentially improve prenatal care. Ultrasound, the current gold standard, is not always affordable in low-resource settings and does not predict spontaneous preterm birth, a leading cause of infant death. In a pilot study of 31 healthy pregnant women, we found that measurement of nine cell-free RNA (cfRNA) transcripts in maternal blood predicted gestational age with comparable accuracy to ultrasound but at substantially lower cost. In a related study of 38 women (23 full-term and 15 preterm deliveries), all at elevated risk of delivering preterm, we identified seven cfRNA transcripts that accurately classified women who delivered preterm up to 2 months in advance of labor. These tests hold promise for prenatal care in both the developed and developing worlds, although they require validation in larger, blinded clinical trials.

178 citations


Journal ArticleDOI
TL;DR: This work proposes and analyzes 3 methods for estimating heterogeneous treatment effects using observational data, and presents results of applying the 2 most promising methods to data from The SPRINT Data Analysis Challenge, from a large randomized trial of a treatment for high blood pressure.
Abstract: When devising a course of treatment for a patient, doctors often have little quantitative evidence on which to base their decisions, beyond their medical education and published clinical trials. Stanford Health Care alone has millions of electronic medical records that are only just recently being leveraged to inform better treatment recommendations. These data present a unique challenge because they are high dimensional and observational. Our goal is to make personalized treatment recommendations based on the outcomes for past patients similar to a new patient. We propose and analyze 3 methods for estimating heterogeneous treatment effects using observational data. Our methods perform well in simulations using a wide variety of treatment effect functions, and we present results of applying the 2 most promising methods to data from The SPRINT Data Analysis Challenge, from a large randomized trial of a treatment for high blood pressure.

135 citations


Journal ArticleDOI
TL;DR: This model not only significantly increased predictive power by combining all datasets, but also revealed novel interactions between different biological modalities, which provides the frameworks for future studies examining deviations implicated in pregnancy‐related pathologies including preterm birth and preeclampsia.
Abstract: Motivation Multiple biological clocks govern a healthy pregnancy. These biological mechanisms produce immunologic, metabolomic, proteomic, genomic and microbiomic adaptations during the course of pregnancy. Modeling the chronology of these adaptations during full-term pregnancy provides the frameworks for future studies examining deviations implicated in pregnancy-related pathologies including preterm birth and preeclampsia.

113 citations


Journal ArticleDOI
TL;DR: A single-cell-based study of B cell precursor acute lymphoblastic leukemia at diagnosis is reported that reveals hidden developmentally dependent cell signaling states that are uniquely associated with relapse.
Abstract: Insight into the cancer cell populations that are responsible for relapsed disease is needed to improve outcomes. Here we report a single-cell-based study of B cell precursor acute lymphoblastic leukemia at diagnosis that reveals hidden developmentally dependent cell signaling states that are uniquely associated with relapse. By using mass cytometry we simultaneously quantified 35 proteins involved in B cell development in 60 primary diagnostic samples. Each leukemia cell was then matched to its nearest healthy B cell population by a developmental classifier that operated at the single-cell level. Machine learning identified six features of expanded leukemic populations that were sufficient to predict patient relapse at diagnosis. These features implicated the pro-BII subpopulation of B cells with activated mTOR signaling, and the pre-BI subpopulation of B cells with activated and unresponsive pre-B cell receptor signaling, to be associated with relapse. This model, termed 'developmentally dependent predictor of relapse' (DDPR), significantly improves currently established risk stratification methods. DDPR features exist at diagnosis and persist at relapse. By leveraging a data-driven approach, we demonstrate the predictive value of single-cell 'omics' for patient stratification in a translational setting and provide a framework for its application to human cancer.

97 citations


Journal ArticleDOI
TL;DR: This work presents a new method for post‐selection inference for ℓ1 (lasso)'penalized likelihood models, including generalized regression models, and presents applications of this work to (regularized) logistic regression, Cox's proportional hazards model, and the graphical lasso.
Abstract: We present a new method for post-selection inference for l1 (lasso)-penalized likelihood models, including generalized regression models. Our approach generalizes the post-selection framework presented in Lee et al. (2013). The method provides p-values and confidence intervals that are asymptotically valid, conditional on the inherent selection done by the lasso. We present applications of this work to (regularized) logistic regression, Cox's proportional hazards model and the graphical lasso. We do not provide rigorous proofs here of the claimed results, but rather conceptual and theoretical sketches.

91 citations


Journal ArticleDOI
TL;DR: Forecasted targets for the 2014-15 challenge were the onset week, peak week, and peak intensity of the season and the weekly percent of outpatient visits due to influenza-like illness 1-4 weeks in advance and the forecast skill varied by HHS region.

83 citations


Journal ArticleDOI
TL;DR: In this article, the authors show that the test statistic of Tibshirani et al. is asymptotically valid, as the number of samples grows and the dimension of the regression problem stays fixed.
Abstract: Recently, Tibshirani et al. [J. Amer. Statist. Assoc. 111 (2016) 600–620] proposed a method for making inferences about parameters defined by model selection, in a typical regression setting with normally distributed errors. Here, we study the large sample properties of this method, without assuming normality. We prove that the test statistic of Tibshirani et al. (2016) is asymptotically valid, as the number of samples $n$ grows and the dimension $d$ of the regression problem stays fixed. Our asymptotic result holds uniformly over a wide class of nonnormal error distributions. We also propose an efficient bootstrap version of this test that is provably (asymptotically) conservative, and in practice, often delivers shorter intervals than those from the original normality-based approach. Finally, we prove that the test statistic of Tibshirani et al. (2016) does not enjoy uniform validity in a high-dimensional setting, when the dimension $d$ is allowed grow.

78 citations


Journal ArticleDOI
TL;DR: A multicenter study demonstrates that DESI-MSI is a robust and reproducible technology for rapid breast-cancer-tissue diagnosis and therefore is of value for clinical use.
Abstract: The histological and molecular subtypes of breast cancer demand distinct therapeutic approaches. Invasive ductal carcinoma (IDC) is subtyped according to estrogen-receptor (ER), progesterone-receptor (PR), and HER2 status, among other markers. Desorption-electrospray-ionization-mass-spectrometry imaging (DESI-MSI) is an ambient-ionization MS technique that has been previously used to diagnose IDC. Aiming to investigate the robustness of ambient-ionization MS for IDC diagnosis and subtyping over diverse patient populations and interlaboratory use, we report a multicenter study using DESI-MSI to analyze samples from 103 patients independently analyzed in the United States and Brazil. The lipid profiles of IDC and normal breast tissues were consistent across different patient races and were unrelated to country of sample collection. Similar experimental parameters used in both laboratories yielded consistent mass-spectral data in mass-to-charge ratios (m/z) above 700, where complex lipids are observed. Stati...

65 citations


Journal ArticleDOI
TL;DR: The capability of desorption electrospray ionization mass spectrometry imaging (DESI-MSI) is established to distinguish between micrometer-sized tumor aggregates of basal cell carcinoma (BCC), a common skin cancer, and normal human skin.
Abstract: Detection of microscopic skin lesions presents a considerable challenge in diagnosing early-stage malignancies as well as in residual tumor interrogation after surgical intervention. In this study, we established the capability of desorption electrospray ionization mass spectrometry imaging (DESI-MSI) to distinguish between micrometer-sized tumor aggregates of basal cell carcinoma (BCC), a common skin cancer, and normal human skin. We analyzed 86 human specimens collected during Mohs micrographic surgery for BCC to cross-examine spatial distributions of numerous lipids and metabolites in BCC aggregates versus adjacent skin. Statistical analysis using the least absolute shrinkage and selection operation (Lasso) was employed to categorize each 200-µm-diameter picture element (pixel) of investigated skin tissue map as BCC or normal. Lasso identified 24 molecular ion signals, which are significant for pixel classification. These ion signals included lipids observed at m / z 200–1,200 and Krebs cycle metabolites observed at m / z

61 citations


Journal ArticleDOI
TL;DR: The increasing prevalence of FA, lack of robust biomarkers, and inadequate treatments warrants further research into the mechanism underlying food allergies, and parallel advances in bioinformatics and computational techniques have enabled the integration, analysis, and interpretation of exponentially growing data sets.
Abstract: Food allergy (FA) prevalence has been increasing over the last few decades and is now a global health concern. Current diagnostic methods for FA result in a high number of false-positive results, and the standard of care is either allergen avoidance or use of epinephrine on accidental exposure, although currently with no other approved treatments. The increasing prevalence of FA, lack of robust biomarkers, and inadequate treatments warrants further research into the mechanism underlying food allergies. Recent technological advances have made it possible to move beyond traditional biological techniques to more sophisticated high-throughput approaches. These technologies have created the burgeoning field of omics sciences, which permit a more systematic investigation of biological problems. Omics sciences, such as genomics, epigenomics, transcriptomics, proteomics, metabolomics, microbiomics, and exposomics, have enabled the construction of regulatory networks and biological pathway models. Parallel advances in bioinformatics and computational techniques have enabled the integration, analysis, and interpretation of these exponentially growing data sets and opens the possibility of personalized or precision medicine for FA.

Journal ArticleDOI
TL;DR: CCGA is a prospective multi-center observational study for development of a noninvasive cfDNA-based multi-cancer detection assay detecting multiple cancers at early stages when curative treatment is more likely to succeed.
Abstract: 12021Background: Globally most cancers are detected at advanced stages with high treatment burden and low cure rates A noninvasive cfDNA blood test detecting multiple cancers at early stages when curative treatment is more likely to succeed is desirable CCGA (NCT02889978) is a prospective multi-center observational study for development of a noninvasive cfDNA-based multi-cancer detection assay Methods: Prospectively collected samples (N = 1627) from 749 controls (no cancer diagnosis, C) and 878 participants (pts) with newly diagnosed untreated cancer (20 tumor types, all stages) were analyzed in a preplanned substudy 3 prototype sequencing assays were performed: paired cfDNA and white blood cell (WBC, 60,000X) targeted sequencing (507 genes) for single nucleotide variants/indels; paired cfDNA and WBC whole genome sequencing (WGS, 30X) for copy number variation; cfDNA whole genome bisulfite sequencing (WGBS, 30X) for methylation For each assay a detection model was developed for all cancer pts; sensit

Journal ArticleDOI
TL;DR: Find In Translation (FIT) is presented, a statistical methodology that leverages public gene expression data to extrapolate the results of a new mouse experiment to expression changes in the equivalent human condition and predicted novel disease-associated genes.
Abstract: Cross-species differences form barriers to translational research that ultimately hinder the success of clinical trials, yet knowledge of species differences has yet to be systematically incorporated in the interpretation of animal models. Here we present Found In Translation (FIT; http://www.mouse2man.org ), a statistical methodology that leverages public gene expression data to extrapolate the results of a new mouse experiment to expression changes in the equivalent human condition. We applied FIT to data from mouse models of 28 different human diseases and identified experimental conditions in which FIT predictions outperformed direct cross-species extrapolation from mouse results, increasing the overlap of differentially expressed genes by 20–50%. FIT predicted novel disease-associated genes, an example of which we validated experimentally. FIT highlights signals that may otherwise be missed and reduces false leads, with no experimental cost. The machine learning approach FIT leverages public mouse and human expression data to improve the translation of mouse model results to analogous human disease.

Journal ArticleDOI
TL;DR: It is shown that individual variation at complement factor H and age-related maculopathy susceptibility 2, genes which predispose to AMD, also determines the effectiveness of nutritional prophylaxis and its use should be based on patient-specific genotype.
Abstract: We evaluated the influence of an antioxidant and zinc nutritional supplement [the Age-Related Eye Disease Study (AREDS) formulation] on delaying or preventing progression to neovascular AMD (NV) in persons with age-related macular degeneration (AMD). AREDS subjects (n = 802) with category 3 or 4 AMD at baseline who had been treated with placebo or the AREDS formulation were evaluated for differences in the risk of progression to NV as a function of complement factor H (CFH) and age-related maculopathy susceptibility 2 (ARMS2) genotype groups. We used published genetic grouping: a two-SNP haplotype risk-calling algorithm to assess CFH, and either the single SNP rs10490924 or 372_815del443ins54 to mark ARMS2 risk. Progression risk was determined using the Cox proportional hazard model. Genetics-treatment interaction on NV risk was assessed using a multiiterative bootstrap validation analysis. We identified strong interaction of genetics with AREDS formulation treatment on the development of NV. Individuals with high CFH and no ARMS2 risk alleles and taking the AREDS formulation had increased progression to NV compared with placebo. Those with low CFH risk and high ARMS2 risk had decreased progression risk. Analysis of CFH and ARMS2 genotype groups from a validation dataset reinforces this conclusion. Bootstrapping analysis confirms the presence of a genetics-treatment interaction and suggests that individual treatment response to the AREDS formulation is largely determined by genetics. The AREDS formulation modifies the risk of progression to NV based on individual genetics. Its use should be based on patient-specific genotype.

Journal ArticleDOI
TL;DR: An approach that identifies top-ranking drug combinations based on the single-cell perturbation response when an individual tumor sample is screened against a panel of single drugs is developed and applied.
Abstract: An individual malignant tumor is composed of a heterogeneous collection of single cells with distinct molecular and phenotypic features, a phenomenon termed intratumoral heterogeneity. Intratumoral heterogeneity poses challenges for cancer treatment, motivating the need for combination therapies. Single-cell technologies are now available to guide effective drug combinations by accounting for intratumoral heterogeneity through the analysis of the signaling perturbations of an individual tumor sample screened by a drug panel. In particular, Mass Cytometry Time-of-Flight (CyTOF) is a high-throughput single-cell technology that enables the simultaneous measurements of multiple ([Formula: see text]40) intracellular and surface markers at the level of single cells for hundreds of thousands of cells in a sample. We developed a computational framework, entitled Drug Nested Effects Models (DRUG-NEM), to analyze CyTOF single-drug perturbation data for the purpose of individualizing drug combinations. DRUG-NEM optimizes drug combinations by choosing the minimum number of drugs that produce the maximal desired intracellular effects based on nested effects modeling. We demonstrate the performance of DRUG-NEM using single-cell drug perturbation data from tumor cell lines and primary leukemia samples.

Posted Content
TL;DR: This work provides a didactic framework that elucidates the relationships between the different approaches and compare them all using a variety of simulations of both randomized and observational data, and shows that researchers estimating heterogenous treatment effects need not limit themselves to a single model-fitting algorithm.
Abstract: Practitioners in medicine, business, political science, and other fields are increasingly aware that decisions should be personalized to each patient, customer, or voter. A given treatment (e.g. a drug or advertisement) should be administered only to those who will respond most positively, and certainly not to those who will be harmed by it. Individual-level treatment effects can be estimated with tools adapted from machine learning, but different models can yield contradictory estimates. Unlike risk prediction models, however, treatment effect models cannot be easily evaluated against each other using a held-out test set because the true treatment effect itself is never directly observed. Besides outcome prediction accuracy, several metrics that can leverage held-out data to evaluate treatment effects models have been proposed, but they are not widely used. We provide a didactic framework that elucidates the relationships between the different approaches and compare them all using a variety of simulations of both randomized and observational data. Our results show that researchers estimating heterogenous treatment effects need not limit themselves to a single model-fitting algorithm. Instead of relying on a single method, multiple models fit by a diverse set of algorithms should be evaluated against each other using an objective function learned from the validation set. The model minimizing that objective should be used for estimating the individual treatment effect for future individuals.

Posted Content
TL;DR: This work shows how to generate hypotheses in a strategic manner that sharply reduces the cost of data exploration and results in useful confidence intervals.
Abstract: Investigators often use the data to generate interesting hypotheses and then perform inference for the generated hypotheses P-values and confidence intervals must account for this explorative data analysis A fruitful method for doing so is to condition any inferences on the components of the data used to generate the hypotheses, thus preventing information in those components from being used again Some currently popular methods "over-condition", leading to wide intervals We show how to perform the minimal conditioning in a computationally tractable way In high dimensions, even this minimal conditioning can lead to intervals that are too wide to be useful, suggesting that up to now the cost of hypothesis generation has been underestimated We show how to generate hypotheses in a strategic manner that sharply reduces the cost of data exploration and results in useful confidence intervals Our discussion focuses on the problem of post-selection inference after fitting a lasso regression model, but we also outline its extension to a much more general setting

Posted ContentDOI
04 Aug 2018-bioRxiv
TL;DR: This work presents SparseSignatures, a novel framework to extract signatures from somatic point mutation data that incorporates DNA replication error as a background, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets.
Abstract: Cancer is the result of mutagenic processes that can be inferred from genome sequences by analysis of mutational signatures. Here we present SparseSignatures, a novel framework to extract mutational signatures from somatic point mutation data. Our approach incorporates DNA replication error as a background, enforces sparsity of non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to very large datasets. We apply SparseSignatures to whole genome sequences of 2827 tumors from 20 cancer types and show by standard metrics that our set of signatures is substantially more robust than previously reported ones, having eliminated redundancy and overfitting. Known mutagens (e.g., UV light, benzo(a)pyrene, APOBEC dysregulation) exhibit single signatures and occur in the expected tissues, a dominant signature with uncertain etiology is present in liver cancers, and other cancers exhibit a mixture of signatures or are dominated by background and CpG methylation signatures. Apart from cancers that are mostly due to environmental mutagens there is virtually no correlation between cancer types and signatures, highlighting the idea that any of several mutagenic pathways can be active in any solid tissue.

Journal ArticleDOI
TL;DR: In this paper, the authors propose a prototype model to test for the presence of simultaneous group-wide signal for groups in isolation, or for multiple groups together, by creating group prototypes, with reference to the response, and then test with likelihood ratio statistics incorporating only these prototypes.
Abstract: Applied statistical problems often come with prespecified groupings to predictors. It is natural to test for the presence of simultaneous group-wide signal for groups in isolation, or for multiple groups together. Current tests for the presence of such signals include the classical F-test or a t-test on unsupervised group prototypes (either group centroids or first principal components). In this article, we propose test statistics that aim for power improvements over these classical approaches. In particular, we first create group prototypes, with reference to the response, and then test with likelihood ratio statistics incorporating only these prototypes. We propose a model, called the “prototype model,” which naturally models this two-step procedure. Furthermore, we introduce an inferential schema detailing the unique considerations for different combinations of prototype formation and univariate/multivariate testing models. The prototype model also suggests new applications to estimation and pr...

Posted Content
TL;DR: An estimation method which allows to recover simultaneously the main effects and the interactions of a mixed data frame and is near optimal under conditions which are met in targeted applications, and an optimization algorithm which provably converges to an optimal solution.
Abstract: A mixed data frame (MDF) is a table collecting categorical, numerical and count observations. The use of MDF is widespread in statistics and the applications are numerous from abundance data in ecology to recommender systems. In many cases, an MDF exhibits simultaneously main effects, such as row, column or group effects and interactions, for which a low-rank model has often been suggested. Although the literature on low-rank approximations is very substantial, with few exceptions, existing methods do not allow to incorporate main effects and interactions while providing statistical guarantees. The present work fills this gap. We propose an estimation method which allows to recover simultaneously the main effects and the interactions. We show that our method is near optimal under conditions which are met in our targeted applications. We also propose an optimization algorithm which provably converges to an optimal solution. Numerical experiments reveal that our method, mimi, performs well when the main effects are sparse and the interaction matrix has low-rank. We also show that mimi compares favorably to existing methods, in particular when the main effects are significantly large compared to the interactions, and when the proportion of missing entries is large. The method is available as an R package on the Comprehensive R Archive Network.

Journal ArticleDOI
TL;DR: In this article, the authors demonstrate analytic approaches for matched studies where two controls are linked to each case and events are accumulating counts rather than binary outcomes, and clarify the distinction between total risk and excess risk (unmatched vs. matched perspectives).

Posted Content
TL;DR: A new method for supervised learning that combines the lasso (ℓ1) penalty with a quadratic penalty that shrinks the coefficient vector toward the feature matrix's leading principal components (PCs).
Abstract: We propose a new method for supervised learning, especially suited to wide data where the number of features is much greater than the number of observations. The method combines the lasso ($\ell_1$) sparsity penalty with a quadratic penalty that shrinks the coefficient vector toward the leading principal components of the feature matrix. We call the proposed method the "principal components lasso" ("pcLasso"). The method can be especially powerful if the features are pre-assigned to groups (such as cell-pathways, assays or protein interaction networks). In that case, pcLasso shrinks each group-wise component of the solution toward the leading principal components of that group. In the process, it also carries out selection of the feature groups. We provide some theory for this method and illustrate it on a number of simulated and real data examples.

Posted Content
TL;DR: A pliable lasso method is introduced for estimation of interaction effects in the Cox proportional hazards model framework, extended to the Cox model for survival data, incorporating modifiers that are either fixed or varying in time into the partial likelihood.
Abstract: We introduce a pliable lasso method for estimation of interaction effects in the Cox proportional hazards model framework. The pliable lasso is a linear model that includes interactions between covariates X and a set of modifying variables Z and assumes sparsity of the main effects and interaction effects. The hierarchical penalty excludes interaction effects when the corresponding main effects are zero: this avoids overfitting and an explosion of model complexity. We extend this method to the Cox model for survival data, incorporating modifiers that are either fixed or varying in time into the partial likelihood. For example, this allows modeling of survival times that differ based on interactions of genes with age, gender, or other demographic information. The optimization is done by blockwise coordinate descent on a second order approximation of the objective.

Journal ArticleDOI
TL;DR: The authors' method, nuclear penalized multinomial regression (NPMR), is applied to Major League Baseball play-by-play data to predict outcome probabilities based on batter–pitcher matchups and suggests a novel understanding of what differentiates players.
Abstract: We propose the nuclear norm penalty as an alternative to the ridge penalty for regularized multinomial regression. This convex relaxation of reduced-rank multinomial regression has the advantage of leveraging underlying structure among the response categories to make better predictions. We apply our method, nuclear penalized multinomial regression (NPMR), to Major League Baseball play-by-play data to predict outcome probabilities based on batter-pitcher matchups. The interpretation of the results meshes well with subject-area expertise and also suggests a novel understanding of what differentiates players.

Proceedings ArticleDOI
TL;DR: A consistent “cancer-like” signal was observed in 99% specificity for invasive cancer, and support the promise of cfDNA assay for early cancer detection, as well as learnings from multiple cfDNA assays reported here.
Abstract: CCGA [NCT02889978] is the largest study of cfDNA-based early cancer detection; the first CCGA learnings from multiple cfDNA assays are reported here. This prospective, multi-center, observational study has enrolled 10,012 of 15,000 demographically-balanced participants at 141 sites. Blood was collected from participants with newly diagnosed therapy-naive cancer (C, case) and participants without a diagnosis of cancer (noncancer [NC], control) as defined at enrollment. This preplanned substudy included 878 cases, 580 controls, and 169 assay controls (n=1627) across 20 tumor types and all clinical stages. All samples were analyzed by: 1) Paired cfDNA and white blood cell (WBC)-targeted sequencing (60,000X, 507 gene panel); a joint caller removed WBC-derived somatic variants and residual technical noise; 2) Paired cfDNA and WBC whole-genome sequencing (WGS; 35X); a novel machine learning algorithm generated cancer-related signal scores; joint analysis identified shared events; and 3) cfDNA whole-genome bisulfite sequencing (WGBS; 34X); normalized scores were generated using abnormally methylated fragments. In the targeted assay, non-tumor WBC-matched cfDNA somatic variants (SNVs/indels) accounted for 76% of all variants in NC and 65% in C. Consistent with somatic mosaicism (i.e., clonal hematopoiesis), WBC-matched variants increased with age; several were non-canonical loss-of-function mutations not previously reported. After WBC variant removal, canonical driver somatic variants were highly specific to C (e.g., in EGFR and PIK3CA, 0 NC had variants vs 11 and 30, respectively, of C). Similarly, of 8 NC with somatic copy number alterations (SCNAs) detected with WGS, 4 were derived from WBCs. WGBS data revealed informative hyper- and hypo-fragment level CpGs (1:2 ratio); a subset was used to calculate methylation scores. A consistent “cancer-like” signal was observed in 99% specificity for invasive cancer, and support the promise of cfDNA assay for early cancer detection. Additional data will be presented on detected plasma:tissue variant concordance and on multi-assay modeling. Citation Format: Alexander A. Aravanis, Geoffrey R. Oxnard, Tara Maddala, Earl Hubbell, Oliver Venn, Arash Jamshidi, Ling Shen, Hamed Amini, John A. Beausang, Craig Betts, Daniel Civello, Konstantin Davydov, Saniya Fazullina, Darya Filippova, Sante Gnerre, Samuel Gross, Chenlu Hou, Roger Jiang, Byoungsok Jung, Kathryn Kurtzman, Collin Melton, Shivani Nautiyal, Jonathan Newman, Joshua Newman, Cosmos Nicolaou, Richard Rava, Onur Sakarya, Ravi Vijaya Satya, Seyedmehdi Shojaee, Kristan Steffen, Anton Valouev, Hui Xu, Jeanne Yue, Nan Zhang, Jose Baselga, Rosanna Lapham, Daron G. Davis, David Smith, Donald Richards, Michael V. Seiden, Charles Swanton, Timothy J. Yeatman, Robert Tibshirani, Christina Curtis, Sylvia K. Plevritis, Richard Williams, Eric Klein, Anne-Renee Hartman, Minetta C. Liu. Development of plasma cell-free DNA (cfDNA) assays for early cancer detection: first insights from the Circulating Cell-Free Genome Atlas Study (CCGA) [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2018; 2018 Apr 14-18; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2018;78(13 Suppl):Abstract nr LB-343.

Journal ArticleDOI
TL;DR: The authors' choice of clinical endpoint, the potential for multiple-testing false positives, and the need for additional study are addressed are addressed, as well as the use of neovascular AMD (nvAMD) as the endpoint.
Abstract: Vickers (1) offers little substantive criticism, but we address three items he mentions: ( i ) our choice of clinical endpoint, ( ii ) the potential for multiple-testing false positives, and ( iii ) the need for additional study. An important distinction of our study (2) is the use of neovascular AMD (nvAMD) as the endpoint. In 2001, the Age-Related Eye Disease Study (AREDS) showed that nutritional supplements reduce progression to overall advanced AMD. This main effect was due to reduced progression to nvAMD, with no impact on progression to the geographic atrophy (GA) form of advanced AMD (3). As Vickers notes (1), Seddon et al. (4) confirmed this pharmacogenetic interaction. However, he misquotes or misunderstands Seddon’s conclusion, who states that “similar results were seen for NV subtype but not GA” (4). Vickers notes that work by Awh et … [↵][1]1To whom correspondence should be addressed. Email: tibs{at}stanford.edu. [1]: #xref-corresp-1-1

Journal ArticleDOI
TL;DR: A new method for supervised learning that fits a hub-based graphical model to the predictors, to estimate the amount of “connection” that each predictor has with other predictors that yields a set of predictor weights that are used in a regularized regression such as the lasso or elastic net.
Abstract: We propose a new method for supervised learning. The hubNet procedure fits a hub-based graphical model to the predictors, to estimate the amount of “connection” that each predictor has with other predictors. This yields a set of predictor weights that are then used in a regularized regression such as the lasso or elastic net. The resulting procedure is easy to implement, can often yield higher or competitive prediction accuracy with fewer features than the lasso, and can give insight into the underlying structure of the predictors. HubNet can be generalized seamlessly to supervised problems such as regularized logistic regression (and other GLMs), Cox’s proportional hazards model, and nonlinear procedures such as random forests and boosting. We prove recovery results under a specialized model and illustrate the method on real and simulated data. HubNet; Adaptive Lasso; Graphical Model; Unsupervised Weights

Posted Content
TL;DR: In this paper, the authors propose a simple method for evaluating the model that has been chosen by an adaptive regression procedure, their main focus being the lasso, which deletes each chosen predictor and refits the Lasso to get a set of models that are "close" to the one chosen, referred to as "base model".
Abstract: We propose a simple method for evaluating the model that has been chosen by an adaptive regression procedure, our main focus being the lasso. This procedure deletes each chosen predictor and refits the lasso to get a set of models that are "close" to the one chosen, referred to as "base model". If the deletion of a predictor leads to significant deterioration in the model's predictive power, the predictor is called indispensable; otherwise, the nearby model is called acceptable and can serve as a good alternative to the base model. This provides both an assessment of the predictive contribution of each variable and a set of alternative models that may be used in place of the chosen model. In this paper, we will focus on the cross-validation (CV) setting and a model's predictive power is measured by its CV error, with base model tuned by cross-validation. We propose a method for comparing the error rates of the base model with that of nearby models, and a p-value for testing whether a predictor is dispensable. We also propose a new quantity called model score which works similarly as the p-value for the control of type I error. Our proposal is closely related to the LOCO (leave-one-covarate-out) methods of ([Rinaldo 2016 Bootstrapping]) and less so, to Stability Selection ([Meinshausen 2010 stability]). We call this procedure "Next-Door analysis" since it examines models close to the base model. It can be applied to Gaussian regression data, generalized linear models, and other supervised learning problems with $\ell_1$ penalization. It could also be applied to best subset and stepwise regression procedures. We have implemented it in the R language as a library to accompany the well-known {\tt glmnet} library.

Journal ArticleDOI
TL;DR: A novel data reduction technique whereby a subset of tiles are selected to ‘cover’ maximally events of interest in large-scale biological datasets (e.g. genetic mutations), while minimizing the number of tiles is introduced.
Abstract: We introduce a novel data reduction technique whereby we select a subset of tiles to "cover" maximally events of interest in large-scale biological datasets (e.g., genetic mutations), while minimizing the number of tiles. A tile is a genomic unit capturing one or more biological events, such as a sequence of base pairs that can be sequenced and observed simultaneously. The goal is to reduce significantly the number of tiles considered to those with areas of dense events in a cohort, thus saving on cost and enhancing interpretability. However, the reduction should not come at the cost of too much information, allowing for sensible statistical analysis after its application. We envisage application of our methods to a variety of high throughput data types, particularly those produced by next generation sequencing (NGS) experiments. The procedure is cast as a convex optimization problem, which is presented, along with methods of its solution. The method is demonstrated on a large dataset of somatic mutations spanning 5000+ patients, each having one of 29 cancer types. Applied to these data, our method dramatically reduces the number of gene locations required for broad coverage of patients and their mutations, giving subject specialists a more easily interpretable snapshot of recurrent mutational profiles in these cancers. The locations identified coincide with previously identified cancer genes. Finally, despite considerable data reduction, we show that our covering designs preserve the cancer discrimination ability of multinomial logistic regression models trained on all of the locations (> 1M).

Posted Content
TL;DR: This work proposes a model which contains latent factors specific to each assay, as well as a common latent factor across assays, which is an optimization problem and presents an iterative algorithm to solve it.
Abstract: In many domains such as healthcare or finance, data often come in different assays or measurement modalities, with features in each assay having a common theme. Simply concatenating these assays together and performing prediction can be effective but ignores this structure. In this setting, we propose a model which contains latent factors specific to each assay, as well as a common latent factor across assays. We frame our model-fitting procedure, which we call the "Sparse Factor Method" (SFM), as an optimization problem and present an iterative algorithm to solve it.