scispace - formally typeset
Search or ask a question
Author

Robert Tibshirani

Bio: Robert Tibshirani is an academic researcher from Stanford University. The author has contributed to research in topics: Lasso (statistics) & Gene expression profiling. The author has an hindex of 147, co-authored 593 publications receiving 326580 citations. Previous affiliations of Robert Tibshirani include University of Toronto & University of California.


Papers
More filters
Journal ArticleDOI
TL;DR: This work uses a log-linear model with a new approach to normalization to derive a novel procedure to estimate the false discovery rate (FDR), and demonstrates that the method has potential advantages over existing methods that are based on a Poisson or negative binomial model.
Abstract: We discuss the identification of genes that are associated with an outcome in RNA sequencing and other sequence-based comparative genomic experiments. RNA-sequencing data take the form of counts, so models based on the Gaussian distribution are unsuitable. Moreover, normalization is challenging because different sequencing experiments may generate quite different total numbers of reads. To overcome these difficulties, we use a log-linear model with a new approach to normalization. We derive a novel procedure to estimate the false discovery rate (FDR). Our method can be applied to data with quantitative, two-class, or multiple-class outcomes, and the computation is fast even for large data sets. We study the accuracy of our approaches for significance calculation and FDR estimation, and we demonstrate that our method has potential advantages over existing methods that are based on a Poisson or negative binomial model. In summary, this work provides a pipeline for the significance analysis of sequencing data.

325 citations

Journal ArticleDOI
TL;DR: Although estrogen receptor was expressed in both the ovarian and breast cancers, genes that are coregulated with the estrogen receptor in breast cancers did not show a similar pattern of coexpression in the ovarian cancers.
Abstract: We used DNA microarrays to characterize the global gene expression patterns in surface epithelial cancers of the ovary. We identified groups of genes that distinguished the clear cell subtype from other ovarian carcinomas, grade I and II from grade III serous papillary carcinomas, and ovarian from breast carcinomas. Six clear cell carcinomas were distinguished from 36 other ovarian carcinomas (predominantly serous papillary) based on their gene expression patterns. The differences may yield insights into the worse prognosis and therapeutic resistance associated with clear cell carcinomas. A comparison of the gene expression patterns in the ovarian cancers to published data of gene expression in breast cancers revealed a large number of differentially expressed genes. We identified a group of 62 genes that correctly classified all 125 breast and ovarian cancer specimens. Among the best discriminators more highly expressed in the ovarian carcinomas were PAX8 (paired box gene 8), mesothelin, and ephrin-B1 (EFNB1). Although estrogen receptor was expressed in both the ovarian and breast cancers, genes that are coregulated with the estrogen receptor in breast cancers, including GATA-3, LIV-1, and X-box binding protein 1, did not show a similar pattern of coexpression in the ovarian cancers.

321 citations

Journal ArticleDOI
09 Apr 2020-Nature
TL;DR: It is shown that, although levels are very low in early-stage lung cancers, ctDNA is present prior to treatment in most patients and its presence is strongly prognostic, and a machine-learning method termed ‘lung cancer likelihood in plasma’ (Lung-CLiP) is developed, which can robustly discriminate early-Stage lung cancer patients from risk-matched controls.
Abstract: Radiologic screening of high-risk adults reduces lung-cancer-related mortality1,2; however, a small minority of eligible individuals undergo such screening in the United States3,4. The availability of blood-based tests could increase screening uptake. Here we introduce improvements to cancer personalized profiling by deep sequencing (CAPP-Seq)5, a method for the analysis of circulating tumour DNA (ctDNA), to better facilitate screening applications. We show that, although levels are very low in early-stage lung cancers, ctDNA is present prior to treatment in most patients and its presence is strongly prognostic. We also find that the majority of somatic mutations in the cell-free DNA (cfDNA) of patients with lung cancer and of risk-matched controls reflect clonal haematopoiesis and are non-recurrent. Compared with tumour-derived mutations, clonal haematopoiesis mutations occur on longer cfDNA fragments and lack mutational signatures that are associated with tobacco smoking. Integrating these findings with other molecular features, we develop and prospectively validate a machine-learning method termed 'lung cancer likelihood in plasma' (Lung-CLiP), which can robustly discriminate early-stage lung cancer patients from risk-matched controls. This approach achieves performance similar to that of tumour-informed ctDNA detection and enables tuning of assay specificity in order to facilitate distinct clinical applications. Our findings establish the potential of cfDNA for lung cancer screening and highlight the importance of risk-matching cases and controls in cfDNA-based screening studies.

320 citations

Journal ArticleDOI
TL;DR: In this article, the authors consider the problem of combining a collection of general regression fit vectors to obtain a better predictive model and develop a general framework for this problem and examine a cross-validation-based proposal called "model mix" or "stacking" in this context.
Abstract: We consider the problem of how to combine a collection of general regression fit vectors to obtain a better predictive model. The individual fits may be from subset linear regression, ridge regression, or something more complex like a neural network. We develop a general framework for this problem and examine a cross-validation—based proposal called “model mix” or “stacking” in this context. We also derive combination methods based on the bootstrap and analytic methods and compare them in examples. Finally, we apply these ideas to classification problems where the estimated combination weights can yield insight into the structure of the problem.

318 citations

Journal ArticleDOI
TL;DR: The proposed penalized maximum likelihood problem is not convex, so the method can be used to solve a previously studied special case in which a desired sparsity pattern is prespecified, and it uses a majorize-minimize approach in which it iteratively solve convex approximations to the original nonconvex problem.
Abstract: We suggest a method for estimating a covariance matrix on the basis of a sample of vectors drawn from a multivariate normal distribution. In particular, we penalize the likelihood with a lasso penalty on the entries of the covariance matrix. This penalty plays two important roles: it reduces the effective number of parameters, which is important even when the dimension of the vectors is smaller than the sample size since the number of parameters grows quadratically in the number of variables, and it produces an estimate which is sparse. In contrast to sparse inverse covariance estimation, our method’s close relative, the sparsity attained here is in the covariance matrix itself rather than in the inverse matrix. Zeros in the covariance matrix correspond to marginal independencies; thus, our method performs model selection while providing a positive definite estimate of the covariance. The proposed penalized maximum likelihood problem is not convex, so we use a majorize-minimize approach in which we iteratively solve convex approximations to the original nonconvex problem. We discuss tuning parameter selection and demonstrate on a flow-cytometry dataset how our method produces an interpretable graphical display of the relationship between variables. We perform simulations that suggest that simple elementwise thresholding of the empirical covariance matrix is competitive with our method for identifying the sparsity structure. Additionally, we show how our method can be used to solve a previously studied special case in which a desired sparsity pattern is prespecified.

307 citations


Cited by
More filters
Journal Article
TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

47,974 citations

Journal ArticleDOI
TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

47,038 citations

Journal ArticleDOI
TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Abstract: SUMMARY We propose a new method for estimation in linear models. The 'lasso' minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree-based models are briefly described.

40,785 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations

Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations