Showing papers by "Robert Tibshirani published in 2017"
••
TL;DR: These findings unravel the precise timing of immunological events occurring during a term pregnancy and provide the analytical framework to identify immunological deviations implicated in pregnancy-related pathologies.
Abstract: The maintenance of pregnancy relies on finely tuned immune adaptations. We demonstrate that these adaptations are precisely timed, reflecting an immune clock of pregnancy in women delivering at term. Using mass cytometry, the abundance and functional responses of all major immune cell subsets were quantified in serial blood samples collected throughout pregnancy. Cell signaling-based Elastic Net, a regularized regression method adapted from the elastic net algorithm, was developed to infer and prospectively validate a predictive model of interrelated immune events that accurately captures the chronology of pregnancy. Model components highlighted existing knowledge and revealed previously unreported biology, including a critical role for the interleukin-2-dependent STAT5ab signaling pathway in modulating T cell function during pregnancy. These findings unravel the precise timing of immunological events occurring during a term pregnancy and provide the analytical framework to identify immunological deviations implicated in pregnancy-related pathologies.
330 citations
•
TL;DR: An expanded set of simulations showed that the classical best subset selection problem in regression modeling can be formulated as a mixed integer optimization (MIO) problem, and that the relaxed lasso is the overall winner, performing just about as well as the lasso in low SNR scenarios, and as much asbest subset selection in highSNR scenarios.
Abstract: In exciting new work, Bertsimas et al. (2016) showed that the classical best subset selection problem in regression modeling can be formulated as a mixed integer optimization (MIO) problem. Using recent advances in MIO algorithms, they demonstrated that best subset selection can now be solved at much larger problem sizes that what was thought possible in the statistics community. They presented empirical comparisons of best subset selection with other popular variable selection procedures, in particular, the lasso and forward stepwise selection. Surprisingly (to us), their simulations suggested that best subset selection consistently outperformed both methods in terms of prediction accuracy. Here we present an expanded set of simulations to shed more light on these comparisons.
The summary is roughly as follows: (a) neither best subset selection nor the lasso uniformly dominate the other, with best subset selection generally performing better in high signal-to-noise (SNR) ratio regimes, and the lasso better in low SNR regimes; (b) best subset selection and forward stepwise perform quite similarly throughout; (c) the relaxed lasso (actually, a simplified version of the original relaxed estimator defined in Meinshausen, 2007) is the overall winner, performing just about as well as the lasso in low SNR scenarios, and as well as best subset selection in high SNR scenarios.
171 citations
••
TL;DR: Measurement of the glucose/citrate ion signal ratio accurately predicted cancer when this ratio exceeds 1.0 and normal prostate when the ratio is less than 0.5, indicated that the ratio of glucose to citrate ion signals could be used to accurately identify prostate cancer.
Abstract: Accurate identification of prostate cancer in frozen sections at the time of surgery can be challenging, limiting the surgeon's ability to best determine resection margins during prostatectomy. We performed desorption electrospray ionization mass spectrometry imaging (DESI-MSI) on 54 banked human cancerous and normal prostate tissue specimens to investigate the spatial distribution of a wide variety of small metabolites, carbohydrates, and lipids. In contrast to several previous studies, our method included Krebs cycle intermediates (m/z <200), which we found to be highly informative in distinguishing cancer from benign tissue. Malignant prostate cells showed marked metabolic derangements compared with their benign counterparts. Using the "Least absolute shrinkage and selection operator" (Lasso), we analyzed all metabolites from the DESI-MS data and identified parsimonious sets of metabolic profiles for distinguishing between cancer and normal tissue. In an independent set of samples, we could use these models to classify prostate cancer from benign specimens with nearly 90% accuracy per patient. Based on previous work in prostate cancer showing that glucose levels are high while citrate is low, we found that measurement of the glucose/citrate ion signal ratio accurately predicted cancer when this ratio exceeds 1.0 and normal prostate when the ratio is less than 0.5. After brief tissue preparation, the glucose/citrate ratio can be recorded on a tissue sample in 1 min or less, which is in sharp contrast to the 20 min or more required by histopathological examination of frozen tissue specimens.
154 citations
••
TL;DR: Desorption electrospray ionization (DESI) mass spectrometry (MS) was used to image and chemically characterize the metabolic profiles of HGSC, BOT, and normal ovarian tissue samples and suggest DESI-MS as a powerful approach for rapid serous ovarian cancer diagnosis based on altered metabolic signatures.
Abstract: Ovarian high-grade serous carcinoma (HGSC) results in the highest mortality among gynecological cancers, developing rapidly and aggressively. Dissimilarly, serous borderline ovarian tumors (BOT) can progress into low-grade serous carcinomas and have relatively indolent clinical behavior. The underlying biological differences between HGSC and BOT call for accurate diagnostic methodologies and tailored treatment options, and identification of molecular markers of aggressiveness could provide valuable biochemical insights and improve disease management. Here, we used desorption electrospray ionization (DESI) mass spectrometry (MS) to image and chemically characterize the metabolic profiles of HGSC, BOT, and normal ovarian tissue samples. DESI-MS imaging enabled clear visualization of fine papillary branches in serous BOT and allowed for characterization of spatial features of tumor heterogeneity such as adjacent necrosis and stroma in HGSC. Predictive markers of cancer aggressiveness were identified, including various free fatty acids, metabolites, and complex lipids such as ceramides, glycerophosphoglycerols, cardiolipins, and glycerophosphocholines. Classification models built from a total of 89,826 individual pixels, acquired in positive and negative ion modes from 78 different tissue samples, enabled diagnosis and prediction of HGSC and all tumor samples in comparison with normal tissues, with overall agreements of 96.4% and 96.2%, respectively. HGSC and BOT discrimination was achieved with an overall accuracy of 93.0%. Interestingly, our classification model allowed identification of three BOT samples presenting unusual histologic features that could be associated with the development of low-grade carcinomas. Our results suggest DESI-MS as a powerful approach for rapid serous ovarian cancer diagnosis based on altered metabolic signatures. Cancer Res; 77(11); 2903-13. ©2017 AACR.
91 citations
••
TL;DR: Quantitative analysis indicated that allelic choice at the majority of RAMA elements is consistent with a stochastic process; however, up to 30% of RAMC elements may deviate from the expected pattern, suggesting a regulated or counting mechanism.
Abstract: Howard Chang and colleagues use allele-specific ATAC–seq to profile active regulatory DNA across the genome in mouse embryonic stem cells and neural progenitor cells. They find that monoallelic DNA accessibility across autosomes is pervasive, developmentally programmed and composed of several patterns.
78 citations
••
TL;DR: Results indicate that precisely timed changes in the plasma proteome during term pregnancy mirror a proteomic clock, and the exciting promise of such a clock is that deviations from its regular chronological profile may assist in the early diagnoses of pregnancy‐related pathologies, and point to underlying pathophysiology.
75 citations
••
TL;DR: In this article, the authors proposed distribution-based methods with exact type 1 error controls for hypothesis testing and construction of confidence intervals for signals in a noisy matrix with finite samples, assuming Gaussian noise, by utilizing a post-selection inference framework, and extending the approach of Taylor, Loftus and Tibshirani (2013) in a PCA setting.
Abstract: Principal component analysis (PCA) is a well-known tool in multivariate statistics. One significant challenge in using PCA is the choice of the number of principal components. In order to address this challenge, we propose distribution-based methods with exact type 1 error controls for hypothesis testing and construction of confidence intervals for signals in a noisy matrix with finite samples. Assuming Gaussian noise, we derive exact type 1 error controls based on the conditional distribution of the singular values of a Gaussian matrix by utilizing a post-selection inference framework, and extending the approach of [Taylor, Loftus and Tibshirani (2013)] in a PCA setting. In simulation studies, we find that our proposed methods compare well to existing approaches.
72 citations
••
TL;DR: Newly identified MIMICS-generated compounds were found to be bioactive as inhibitors of specific components of the unfolded protein response and the VEGFR2 pathway in cell-based assays, thus confirming the applicability of this methodology toward drug design applications.
Abstract: We describe a new library generation method, Machine-based Identification of Molecules Inside Characterized Space (MIMICS), that generates sets of molecules inspired by a text-based input. MIMICS-generated libraries were found to preserve distributions of properties while simultaneously increasing structural diversity. Newly identified MIMICS-generated compounds were found to be bioactive as inhibitors of specific components of the unfolded protein response (UPR) and the VEGFR2 pathway in cell-based assays, thus confirming the applicability of this methodology toward drug design applications. Wider application of MIMICS could facilitate the efficient utilization of chemical space.
64 citations
••
TL;DR: It is demonstrated that POAMLs harbor a persistent and ongoing risk of relapse, including in the central nervous system, and transformation to aggressive lymphoma (4%), requiring long-term follow-up.
52 citations
••
TL;DR: A statistical model is demonstrated using hospital patient data to quantitatively forecast, days in advance, the need for platelet transfusions, and this approach can be leveraged to significantly decrease platelet wastage, and, if adopted nationwide, would save approximately 80 million dollars per year.
Abstract: Maintaining a robust blood product supply is an essential requirement to guarantee optimal patient care in modern health care systems. However, daily blood product use is difficult to anticipate. Platelet products are the most variable in daily usage, have short shelf lives, and are also the most expensive to produce, test, and store. Due to the combination of absolute need, uncertain daily demand, and short shelf life, platelet products are frequently wasted due to expiration. Our aim is to build and validate a statistical model to forecast future platelet demand and thereby reduce wastage. We have investigated platelet usage patterns at our institution, and specifically interrogated the relationship between platelet usage and aggregated hospital-wide patient data over a recent consecutive 29-mo period. Using a convex statistical formulation, we have found that platelet usage is highly dependent on weekday/weekend pattern, number of patients with various abnormal complete blood count measurements, and location-specific hospital census data. We incorporated these relationships in a mathematical model to guide collection and ordering strategy. This model minimizes waste due to expiration while avoiding shortages; the number of remaining platelet units at the end of any day stays above 10 in our model during the same period. Compared with historical expiration rates during the same period, our model reduces the expiration rate from 10.5 to 3.2%. Extrapolating our results to the ∼2 million units of platelets transfused annually within the United States, if implemented successfully, our model can potentially save ∼80 million dollars in health care costs.
45 citations
••
TL;DR: In this paper, the authors adapt recent developments by Lee et al. in post selection inference for the Lasso to the orthogonal setting, where sample elements have different underlying signal sizes.
Abstract: We tackle the problem of the estimation of a vector of means from a single vector-valued observation $y$. Whereas previous work reduces the size of the estimates for the largest (absolute) sample elements via shrinkage (like James-Stein) or biases estimated via empirical Bayes methodology, we take a novel approach. We adapt recent developments by Lee et al (2013) in post selection inference for the Lasso to the orthogonal setting, where sample elements have different underlying signal sizes. This is exactly the setup encountered when estimating many means. It is shown that other selection procedures, like selecting the $K$ largest (absolute) sample elements and the Benjamini-Hochberg procedure, can be cast into their framework, allowing us to leverage their results. Point and interval estimates for signal sizes are proposed. These seem to perform quite well against competitors, both recent and more tenured.
Furthermore, we prove an upper bound to the worst case risk of our estimator, when combined with the Benjamini-Hochberg procedure, and show that it is within a constant multiple of the minimax risk over a rich set of parameter spaces meant to evoke sparsity.
•
TL;DR: This work proposes synth-validation, a procedure that estimates the estimation error of causal inference methods applied to a given dataset and applies each causal inference method to datasets sampled from these distributions and compares the effect estimates with the known effects to estimate error.
Abstract: Many decisions in healthcare, business, and other policy domains are made without the support of rigorous evidence due to the cost and complexity of performing randomized experiments. Using observational data to answer causal questions is risky: subjects who receive different treatments also differ in other ways that affect outcomes. Many causal inference methods have been developed to mitigate these biases. However, there is no way to know which method might produce the best estimate of a treatment effect in a given study. In analogy to cross-validation, which estimates the prediction error of predictive models applied to a given dataset, we propose synth-validation, a procedure that estimates the estimation error of causal inference methods applied to a given dataset. In synth-validation, we use the observed data to estimate generative distributions with known treatment effects. We apply each causal inference method to datasets sampled from these distributions and compare the effect estimates with the known effects to estimate error. Using simulations, we show that using synth-validation to select a causal inference method for each study lowers the expected estimation error relative to consistently using any single method.
•
[...]
TL;DR: This article proposed a generalization of the lasso that allows the model coefficients to vary as a function of a general set of modifying variables, such as gender, age or time, and presented a computationally efficient algorithm for its optimization.
Abstract: We propose a generalization of the lasso that allows the model coefficients to vary as a function of a general set of modifying variables. These modifiers might be variables such as gender, age or time. The paradigm is quite general, with each lasso coefficient modified by a sparse linear function of the modifying variables $Z$. The model is estimated in a hierarchical fashion to control the degrees of freedom and avoid overfitting. The modifying variables may be observed, observed only in the training set, or unobserved overall. There are connections of our proposal to varying coefficient models and high-dimensional interaction models. We present a computationally efficient algorithm for its optimization, with exact screening rules to facilitate application to large numbers of predictors. The method is illustrated on a number of different simulated and real examples.
••
TL;DR: The new approach provides a feasible, simple, and efficient method for analyzing matched designs with double controls and agrees closely with conditional logistic regression and are sufficiently simple as to be computed on a handheld calculator.
•
TL;DR: In this paper, the authors proposed and analyzed three methods for estimating heterogeneous treatment effects using observational data and applied them to data from a large randomized trial of a treatment for high blood pressure.
Abstract: When devising a course of treatment for a patient, doctors often have little quantitative evidence on which to base their decisions, beyond their medical education and published clinical trials. Stanford Health Care alone has millions of electronic medical records (EMRs) that are only just recently being leveraged to inform better treatment recommendations. These data present a unique challenge because they are high-dimensional and observational. Our goal is to make personalized treatment recommendations based on the outcomes for past patients similar to a new patient. We propose and analyze three methods for estimating heterogeneous treatment effects using observational data. Our methods perform well in simulations using a wide variety of treatment effect functions, and we present results of applying the two most promising methods to data from The SPRINT Data Analysis Challenge, from a large randomized trial of a treatment for high blood pressure.
•
TL;DR: This work proposes a sparse canonical correlation analysis by adding l1 constraints on the canonical vectors and shows how to solve it efficiently using linearized alternating direction method of multipliers (ADMM) and using TFOCS as a black box.
Abstract: Canonical correlation analysis was proposed by Hotelling [6] and it measures linear relationship between two multidimensional variables In high dimensional setting, the classical canonical correlation analysis breaks down We propose a sparse canonical correlation analysis by adding l1 constraints on the canonical vectors and show how to solve it efficiently using linearized alternating direction method of multipliers (ADMM) and using TFOCS as a black box We illustrate this idea on simulated data
••
TL;DR: An embedding of the log‐ratio parameter space into a space of much lower dimension is introduced and used as the foundation for a two‐step fitting procedure that combines a convex filtering step with a second non‐convex pruning step to yield highly sparse solutions.
Abstract: Positive-valued signal data is common in many biological and medical applications, where the data are often generated from imaging techniques such as mass spectrometry. In such a setting, the relative intensities of the raw features are often the scientifically meaningful quantities, so it is of interest to identify relevant features that take the form of log-ratios of the raw inputs. When including the log-ratios of all pairs of predictors, the dimensionality of this predictor space becomes large, so computationally efficient statistical procedures are required. We introduce an embedding of the log-ratio parameter space into a space of much lower dimension and develop efficient penalized fitting procedure using this more tractable representation. This procedure serves as the foundation for a two-step fitting procedure that combines a convex filtering step with a second non-convex pruning step to yield highly sparse solutions. On a cancer proteomics data set we find that these methods fit highly sparse models with log-ratio features of known biological relevance while greatly improving upon the predictive accuracy of less interpretable methods.
••
TL;DR: KLHL6 immunohistochemistry may prove a useful adjunct in the diagnosis and future classification of B-cell lymphomas.
Abstract: Objectives KLHL6 is a recently described BTB-Kelch protein with selective expression in lymphoid tissues and is most strongly expressed in germinal center B cells. Methods Using gene expression profiling as well as immunohistochemistry with an anti-KLHL6 monoclonal antibody, we have characterized the expression of this molecule in normal and neoplastic tissues. Protein expression was evaluated in 1,058 hematopoietic neoplasms. Results Consistent with its discovery as a germinal center marker, KLHL6 was positive mainly in B-cell neoplasms of germinal center derivation, including 95% of follicular lymphomas (106/112). B-cell lymphomas of non-germinal center derivation were generally negative (0/33 chronic lymphocytic leukemias/small lymphocytic lymphomas, 3/49 marginal zone lymphomas, and 2/66 mantle cell lymphomas). Conclusions In addition to other germinal center markers, including BCL6, CD10, HGAL, and LMO2, KLHL6 immunohistochemistry may prove a useful adjunct in the diagnosis and future classification of B-cell lymphomas.
•
TL;DR: The authors proposed the nuclear norm penalty as an alternative to the ridge penalty for regularized multinomial regression, which has the advantage of leveraging underlying structure among the response categories to make better predictions.
Abstract: We propose the nuclear norm penalty as an alternative to the ridge penalty for regularized multinomial regression. This convex relaxation of reduced-rank multinomial regression has the advantage of leveraging underlying structure among the response categories to make better predictions. We apply our method, nuclear penalized multinomial regression (NPMR), to Major League Baseball play-by-play data to predict outcome probabilities based on batter-pitcher matchups. The interpretation of the results meshes well with subject-area expertise and also suggests a novel understanding of what differentiates players.
••
TL;DR: Baseline and interim ctDNA measurements have prognostic significance in aggressive lymphomas and are integrated with established risk-factors to develop a model to predict an individual9s disease risk.