Home
/
Authors
/
Wenfei Du

Author

Wenfei Du

Bio: Wenfei Du is an academic researcher from Stanford University. The author has contributed to research in topics: Lasso (statistics) & Elastic net regularization. The author has an hindex of 5, co-authored 10 publications receiving 199 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

High-dimensional regression adjustments in randomized experiments

[...]

Stefan Wager¹, Wenfei Du¹, Jonathan Taylor¹, Robert Tibshirani¹•Institutions (1)

Stanford University¹

08 Nov 2016-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: This work studies the problem of treatment effect estimation in randomized experiments with high-dimensional covariate information and shows that essentially any risk-consistent regression adjustment can be used to obtain efficient estimates of the average treatment effect.

...read moreread less

Abstract: We study the problem of treatment effect estimation in randomized experiments with high-dimensional covariate information and show that essentially any risk-consistent regression adjustment can be used to obtain efficient estimates of the average treatment effect. Our results considerably extend the range of settings where high-dimensional regression adjustments are guaranteed to provide valid inference about the population average treatment effect. We then propose cross-estimation, a simple method for obtaining finite-sample-unbiased treatment effect estimates that leverages high-dimensional regression adjustments. Our method can be used when the regression model is estimated using the lasso, the elastic net, subset selection, etc. Finally, we extend our analysis to allow for adaptive specification search via cross-validation and flexible nonparametric regression adjustments with machine-learning methods such as random forests or neural networks.

...read moreread less

97 citations

Journal Article•DOI•

A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank

[...]

Junyang Qian¹, Yosuke Tanigawa¹, Wenfei Du¹, Matthew Aguirre¹, Christopher C. Chang, Robert Tibshirani¹, Manuel A. Rivas¹, Trevor Hastie¹ - Show less +4 more•Institutions (1)

Stanford University¹

23 Oct 2020-PLOS Genetics

TL;DR: A computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size is proposed.

...read moreread less

Abstract: The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports l1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with l1/l2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.

...read moreread less

81 citations

Journal Article•DOI•

Found In Translation: a machine learning model for mouse-to-human inference

[...]

Rachelly Normand¹, Wenfei Du², Mayan Briller¹, Renaud Gaujoux¹, Elina Starosvetsky¹, Amit Ziv-Kenet¹, Gali Shalev-Malul¹, Robert Tibshirani², Shai S. Shen-Orr¹ - Show less +5 more•Institutions (2)

Technion – Israel Institute of Technology¹, Stanford University²

26 Nov 2018-Nature Methods

TL;DR: Find In Translation (FIT) is presented, a statistical methodology that leverages public gene expression data to extrapolate the results of a new mouse experiment to expression changes in the equivalent human condition and predicted novel disease-associated genes.

...read moreread less

Abstract: Cross-species differences form barriers to translational research that ultimately hinder the success of clinical trials, yet knowledge of species differences has yet to be systematically incorporated in the interpretation of animal models. Here we present Found In Translation (FIT; http://www.mouse2man.org ), a statistical methodology that leverages public gene expression data to extrapolate the results of a new mouse experiment to expression changes in the equivalent human condition. We applied FIT to data from mouse models of 28 different human diseases and identified experimental conditions in which FIT predictions outperformed direct cross-species extrapolation from mouse results, increasing the overlap of differentially expressed genes by 20–50%. FIT predicted novel disease-associated genes, an example of which we validated experimentally. FIT highlights signals that may otherwise be missed and reduces false leads, with no experimental cost. The machine learning approach FIT leverages public mouse and human expression data to improve the translation of mouse model results to analogous human disease.

...read moreread less

52 citations

Posted Content•DOI•

A Fast and Flexible Algorithm for Solving the Lasso in Large-scale and Ultrahigh-dimensional Problems

[...]

Junyang Qian¹, Wenfei Du¹, Yosuke Tanigawa¹, Matthew Aguirre¹, Robert Tibshirani¹, Manuel A. Rivas¹, Trevor Hastie¹ - Show less +3 more•Institutions (1)

Stanford University¹

07 May 2019-bioRxiv

TL;DR: A meta algorithm batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and build a scalable lasso solution for large datasets, and achieves state-of-the-art heritability estimation on quantitative and qualitative traits.

...read moreread less

Abstract: Since its first proposal in statistics (Tibshirani, 1996), the lasso has been an effective method for simultaneous variable selection and estimation. A number of packages have been developed to solve the lasso efficiently. However as large datasets become more prevalent, many algorithms are constrained by efficiency or memory bounds. In this paper, we propose a meta algorithm batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and build a scalable lasso solution for large datasets. We also introduce snpnet, an R package that implements the proposed algorithm on top of glmnet (Friedman et al., 2010a) for large-scale single nucleotide polymorphism (SNP) datasets that are widely studied in genetics. We demonstrate results on a large genotype-phenotype dataset from the UK Biobank, where we achieve state-of-the-art heritability estimation on quantitative and qualitative traits including height, body mass index, asthma and high cholesterol.

...read moreread less

24 citations

Posted Content•DOI•

A Fast and Scalable Framework for Large-scale and Ultrahigh-dimensional Sparse Regression with Application to the UK Biobank

[...]

Junyang Qian¹, Yosuke Tanigawa¹, Wenfei Du¹, Matthew Aguirre¹, Christopher C. Chang, Robert Tibshirani¹, Manuel A. Rivas¹, Trevor Hastie¹ - Show less +4 more•Institutions (1)

Stanford University¹

31 May 2020-bioRxiv

TL;DR: A novel computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size is proposed.

...read moreread less

Abstract: The UK Biobank (Bycroft et al., 2018) is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with GWAS, have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso (Tibshirani, 1996), since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet (Friedman et al., 2010a) and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports l1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with l1/l2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve superior predictive performance on quantitative and qualitative traits including height, body mass index, asthma and high cholesterol.

...read moreread less

16 citations

Cited by

PDF

Open Access

More filters

Hundreds of variants clustered in genomic loci and biological pathways affect human height

[...]

Hana Lango Allen, Karol Estrada, Guillaume Lettre, Sonja I. Berndt +286 more

01 Jan 2010

TL;DR: In this paper, the authors show that hundreds of genetic variants, in at least 180 loci, influence adult height, a highly heritable and classic polygenic trait, revealing patterns with important implications for genetic studies of common human diseases and traits.

...read moreread less

Abstract: Most common human traits and diseases have a polygenic pattern of inheritance: DNA sequence variants at many genetic loci influence the phenotype. Genome-wide association (GWA) studies have identified more than 600 variants associated with human traits, but these typically explain small fractions of phenotypic variation, raising questions about the use of further studies. Here, using 183,727 individuals, we show that hundreds of genetic variants, in at least 180 loci, influence adult height, a highly heritable and classic polygenic trait. The large number of loci reveals patterns with important implications for genetic studies of common human diseases and traits. First, the 180 loci are not random, but instead are enriched for genes that are connected in biological pathways (P = 0.016) and that underlie skeletal growth defects (P < 0.001). Second, the likely causal gene is often located near the most strongly associated variant: in 13 of 21 loci containing a known skeletal growth gene, that gene was closest to the associated variant. Third, at least 19 loci have multiple independently associated variants, suggesting that allelic heterogeneity is a frequent feature of polygenic traits, that comprehensive explorations of already-discovered loci should discover additional variants and that an appreciable fraction of associated loci may have been identified. Fourth, associated variants are enriched for likely functional effects on genes, being over-represented among variants that alter amino-acid structure of proteins and expression levels of nearby genes. Our data explain approximately 10% of the phenotypic variation in height, and we estimate that unidentified common variants of similar effect sizes would increase this figure to approximately 16% of phenotypic variation (approximately 20% of heritable variation). Although additional approaches are needed to dissect the genetic architecture of polygenic human traits fully, our findings indicate that GWA studies can identify large numbers of loci that implicate biologically relevant genes and pathways.

...read moreread less

1,751 citations

Journal Article•DOI•

Approximate residual balancing: debiased inference of average treatment effects in high dimensions

[...]

Susan Athey¹, Guido W. Imbens¹, Stefan Wager¹•Institutions (1)

Stanford University¹

01 Sep 2018-Journal of The Royal Statistical Society Series B-statistical Methodology

TL;DR: A method for debiasing penalized regression adjustments to allow sparse regression methods like the lasso to be used for √n‐consistent inference of average treatment effects in high dimensional linear models.

...read moreread less

Abstract: There are many settings where researchers are interested in estimating average treatment effects and are willing to rely on the unconfoundedness assumption, which requires that the treatment assignment be as good as random conditional on pretreatment variables. The unconfoundedness assumption is often more plausible if a large number of pretreatment variables are included in the analysis, but this can worsen the performance of standard approaches to treatment effect estimation. We develop a method for debiasing penalized regression adjustments to allow sparse regression methods like the lasso to be used for √n‐consistent inference of average treatment effects in high dimensional linear models. Given linearity, we do not need to assume that the treatment propensities are estimable, or that the average treatment effect is a sparse contrast of the outcome model parameters. Rather, in addition to standard assumptions used to make lasso regression on the outcome model consistent under 1‐norm error, we require only overlap, i.e. that the propensity score be uniformly bounded away from 0 and 1. Procedurally, our method combines balancing weights with a regularized regression adjustment.

...read moreread less

326 citations

Book Chapter•DOI•

Chapter 3 - The Econometrics of Randomized Experimentsa

[...]

S. Athey¹, G.W. Imbens¹•Institutions (1)

Stanford University¹

01 Jan 2017

TL;DR: This chapter presents econometric and statistical methods for analyzing randomized experiments, and considers, in detail, estimation and inference for heterogenous treatment effects in settings with (possibly many) covariates.

...read moreread less

Abstract: In this chapter, we present econometric and statistical methods for analyzing randomized experiments. For basic experiments, we stress randomization-based inference as opposed to sampling-based inference. In randomization-based inference, uncertainty in estimates arises naturally from the random assignment of the treatments, rather than from hypothesized sampling from a large population. We show how this perspective relates to regression analyses for randomized experiments. We discuss the analyses of stratified, paired, and clustered randomized experiments, and we stress the general efficiency gains from stratification. We also discuss complications in randomized experiments such as noncompliance. In the presence of noncompliance, we contrast intention-to-treat analyses with instrumental variables analyses allowing for general treatment effect heterogeneity. We consider, in detail, estimation and inference for heterogenous treatment effects in settings with (possibly many) covariates. These methods allow researchers to explore heterogeneity by identifying subpopulations with different treatment effects while maintaining the ability to construct valid confidence intervals. We also discuss optimal assignment to treatment based on covariates in such settings. Finally, we discuss estimation and inference in experiments in settings with interactions between units, both in general network settings and in settings where the population is partitioned into groups with all interactions contained within these groups.

...read moreread less

291 citations

Posted Content•

Quasi-Oracle Estimation of Heterogeneous Treatment Effects

[...]

Xinkun Nie¹, Stefan Wager¹•Institutions (1)

Stanford University¹

13 Dec 2017-arXiv: Machine Learning

TL;DR: This paper develops a general class of two-step algorithms for heterogeneous treatment effect estimation in observational studies that have a quasi-oracle property, and implements variants of this approach based on penalized regression, kernel ridge regression, and boosting, and find promising performance relative to existing baselines.

...read moreread less

Abstract: Flexible estimation of heterogeneous treatment effects lies at the heart of many statistical challenges, such as personalized medicine and optimal resource allocation. In this paper, we develop a general class of two-step algorithms for heterogeneous treatment effect estimation in observational studies. We first estimate marginal effects and treatment propensities in order to form an objective function that isolates the causal component of the signal. Then, we optimize this data-adaptive objective function. Our approach has several advantages over existing methods. From a practical perspective, our method is flexible and easy to use: In both steps, we can use any loss-minimization method, e.g., penalized regression, deep neural networks, or boosting; moreover, these methods can be fine-tuned by cross validation. Meanwhile, in the case of penalized kernel regression, we show that our method has a quasi-oracle property: Even if the pilot estimates for marginal effects and treatment propensities are not particularly accurate, we achieve the same error bounds as an oracle who has a priori knowledge of these two nuisance components. We implement variants of our approach based on penalized regression, kernel ridge regression, and boosting in a variety of simulation setups, and find promising performance relative to existing baselines.

...read moreread less

289 citations

Book Chapter•DOI•

The Econometrics of Randomized Experiments

[...]

Susan Athey¹, Susan Athey², Guido W. Imbens¹, Guido W. Imbens²•Institutions (2)

National Bureau of Economic Research¹, Stanford University²

01 Jan 2017-arXiv: Methodology

TL;DR: In this paper, the authors present econometric and statistical methods for analyzing randomized experiments, and stress the general efficiency gains from stratification, and contrast intention to treat analyses with instrumental variables analyses allowing for general treatment effect heterogeneity.

...read moreread less

268 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

Collapse