scispace - formally typeset
Search or ask a question
Author

Robert Tibshirani

Bio: Robert Tibshirani is an academic researcher from Stanford University. The author has contributed to research in topics: Lasso (statistics) & Elastic net regularization. The author has an hindex of 147, co-authored 593 publications receiving 326580 citations. Previous affiliations of Robert Tibshirani include University of Toronto & University of California.


Papers
More filters
Book ChapterDOI
07 May 2015

3 citations

Posted Content
TL;DR: In this paper, a new sparse regression method called the component lasso is proposed, which uses the connected-components structure of the sample covariance matrix to split the problem into smaller ones and then solves the subproblems separately, obtaining a coefficient vector for each one.
Abstract: We propose a new sparse regression method called the component lasso, based on a simple idea. The method uses the connected-components structure of the sample covariance matrix to split the problem into smaller ones. It then solves the subproblems separately, obtaining a coefficient vector for each one. Then, it uses non-negative least squares to recombine the different vectors into a single solution. This step is useful in selecting and reweighting components that are correlated with the response. Simulated and real data examples show that the component lasso can outperform standard regression methods such as the lasso and elastic net, achieving a lower mean squared error as well as better support recovery.

3 citations

Proceedings ArticleDOI
TL;DR: A consistent “cancer-like” signal was observed in 99% specificity for invasive cancer, and support the promise of cfDNA assay for early cancer detection, as well as learnings from multiple cfDNA assays reported here.
Abstract: CCGA [NCT02889978] is the largest study of cfDNA-based early cancer detection; the first CCGA learnings from multiple cfDNA assays are reported here. This prospective, multi-center, observational study has enrolled 10,012 of 15,000 demographically-balanced participants at 141 sites. Blood was collected from participants with newly diagnosed therapy-naive cancer (C, case) and participants without a diagnosis of cancer (noncancer [NC], control) as defined at enrollment. This preplanned substudy included 878 cases, 580 controls, and 169 assay controls (n=1627) across 20 tumor types and all clinical stages. All samples were analyzed by: 1) Paired cfDNA and white blood cell (WBC)-targeted sequencing (60,000X, 507 gene panel); a joint caller removed WBC-derived somatic variants and residual technical noise; 2) Paired cfDNA and WBC whole-genome sequencing (WGS; 35X); a novel machine learning algorithm generated cancer-related signal scores; joint analysis identified shared events; and 3) cfDNA whole-genome bisulfite sequencing (WGBS; 34X); normalized scores were generated using abnormally methylated fragments. In the targeted assay, non-tumor WBC-matched cfDNA somatic variants (SNVs/indels) accounted for 76% of all variants in NC and 65% in C. Consistent with somatic mosaicism (i.e., clonal hematopoiesis), WBC-matched variants increased with age; several were non-canonical loss-of-function mutations not previously reported. After WBC variant removal, canonical driver somatic variants were highly specific to C (e.g., in EGFR and PIK3CA, 0 NC had variants vs 11 and 30, respectively, of C). Similarly, of 8 NC with somatic copy number alterations (SCNAs) detected with WGS, 4 were derived from WBCs. WGBS data revealed informative hyper- and hypo-fragment level CpGs (1:2 ratio); a subset was used to calculate methylation scores. A consistent “cancer-like” signal was observed in 99% specificity for invasive cancer, and support the promise of cfDNA assay for early cancer detection. Additional data will be presented on detected plasma:tissue variant concordance and on multi-assay modeling. Citation Format: Alexander A. Aravanis, Geoffrey R. Oxnard, Tara Maddala, Earl Hubbell, Oliver Venn, Arash Jamshidi, Ling Shen, Hamed Amini, John A. Beausang, Craig Betts, Daniel Civello, Konstantin Davydov, Saniya Fazullina, Darya Filippova, Sante Gnerre, Samuel Gross, Chenlu Hou, Roger Jiang, Byoungsok Jung, Kathryn Kurtzman, Collin Melton, Shivani Nautiyal, Jonathan Newman, Joshua Newman, Cosmos Nicolaou, Richard Rava, Onur Sakarya, Ravi Vijaya Satya, Seyedmehdi Shojaee, Kristan Steffen, Anton Valouev, Hui Xu, Jeanne Yue, Nan Zhang, Jose Baselga, Rosanna Lapham, Daron G. Davis, David Smith, Donald Richards, Michael V. Seiden, Charles Swanton, Timothy J. Yeatman, Robert Tibshirani, Christina Curtis, Sylvia K. Plevritis, Richard Williams, Eric Klein, Anne-Renee Hartman, Minetta C. Liu. Development of plasma cell-free DNA (cfDNA) assays for early cancer detection: first insights from the Circulating Cell-Free Genome Atlas Study (CCGA) [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2018; 2018 Apr 14-18; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2018;78(13 Suppl):Abstract nr LB-343.

3 citations

Book
01 Jan 2005
TL;DR: Two new methods, based on lasso, that can produce sparse, interpretable regression models that relate clusters of co-expressed genes to a quantitative phenotype are proposed, and a need for supervised clustering of genes is discussed, that is, the phenotype ought to have an influence on how genes are clustered.
Abstract: In the past decade, DNA and oligonucleotide microarray technology has been developed, allowing gene expression levels to be measured on a genome-wide scale. Use of this massive amount of molecular information appears to be promising for discovering genetic networks. Classification based on microarray experiments has been studied extensively. In comparison, microarray gene expression data has been analyzed less frequently in a regression set-up. From a statistical point of view, the challenge with analyzing microarray gene expression data is due to the very large number of genes, which far exceeds the sample size, i.e., the so-called “large p, small n” scenario. The lasso (least absolute shrinkage and selection operator) method is a promising regression method that incorporates automatic variable selection by imposing an L1 penalty on the regression coefficients. However the lasso method has its limitations in the “large p, small n” scenario. When p > n, the lasso method can select up to n variables before it saturates. And the lasso method does not offer a “grouped selection” effect. Therefore we propose two new methods, based on lasso, that are particularly suitable for microarray data regression analysis. The methods can produce sparse, interpretable regression models that relate clusters of co-expressed genes to a quantitative phenotype. Our methods are tested on simulated data sets as well as real microarray data sets. Besides the proposal of novel regression methods, we also propose quantitative definitions for evaluating the strength of the “grouped variable” effect in fitted regression models. The new definitions allow us to compare regression models quantitatively. We then discuss a need for supervised clustering of genes, that is, the phenotype ought to have an influence on how genes are clustered. One potential approach is to re-define the distances between pairs of genes by incorporating the phenotype into the definition of the new distance metric.

3 citations

Journal ArticleDOI
TL;DR: In this paper, the authors developed a new technique to estimate the integral of the distribution of T2 relaxation time without imposing any constraint other than the monotonicity of the underlying cumulative relaxation time distribution.
Abstract: Magnetic resonance imaging techniques can be used to measure some biophysical properties of tissue. In this context, the T2 relaxation time is an important parameter for soft-tissue contrast. The authors develop a new technique to estimate the integral of the distribution of T2 relaxation time without imposing any constraint other than the monotonicity of the underlying cumulative relaxation time distribution. They explore the properties of the estimation and its applications for the analysis of breast tissue data. As they show, an extension of linear discriminant analysis is found to distinguish well between two classes of breast tissue. Estimation de la loi du temps de decontraction en imagerie par resonance magnetique Les techniques d'imagerie par resonance magnetique permettent de mesurer certaines proprietes biophysiques des tissus. Dans ce contexte, le temps de decontraction T2 est un parametre important pour l'identification des tissus mous. Les auteurs proposent une nouvelle technique d'estimation de l'integrate de la loi du temps de decontraction T2 sans imposer d'autres contraintes que la monotonicite de la fonction de repartition de la variable sous-jacente. Ils explorent les proprietes de l'estimateur et montrent son utilite dans l'analyse de tissus mammaires. Comme ils le font valoir, une generalisation de l'analyse discriminante lineaire permet de distinguer nettement entre deux types de tissus mammaires.

2 citations


Cited by
More filters
Journal Article
TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

47,974 citations

Journal ArticleDOI
TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

47,038 citations

Journal ArticleDOI
TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Abstract: SUMMARY We propose a new method for estimation in linear models. The 'lasso' minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree-based models are briefly described.

40,785 citations

Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations

Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations