scispace - formally typeset
Search or ask a question
Author

Wei Pan

Bio: Wei Pan is an academic researcher from University of Minnesota. The author has contributed to research in topics: Genome-wide association study & Feature selection. The author has an hindex of 51, co-authored 237 publications receiving 11057 citations.


Papers
More filters
Journal ArticleDOI
Wei Pan1
TL;DR: This work proposes a modification to AIC, where the likelihood is replaced by the quasi-likelihood and a proper adjustment is made for the penalty term.
Abstract: Correlated response data are common in biomedical studies. Regression analysis based on the generalized estimating equations (GEE) is an increasingly important method for such data. However, there seem to be few model-selection criteria available in GEE. The well-known Akaike Information Criterion (AIC) cannot be directly applied since AIC is based on maximum likelihood estimation while GEE is nonlikelihood based. We propose a modification to AIC, where the likelihood is replaced by the quasi-likelihood and a proper adjustment is made for the penalty term. Its performance is investigated through simulation studies. For illustration, the method is applied to a real data set.

2,233 citations

Journal ArticleDOI
Wei Pan1
TL;DR: All the three methods here are based on using the two-sample t-statistic or its minor variation, but they differ in how to associate a statistical significance level to the corresponding statistic, leading to possibly large difference in the resulting significance levels and the numbers of genes detected.
Abstract: Motivation A common task in analyzing microarray data is to determine which genes are differentially expressed across two kinds of tissue samples or samples obtained under two experimental conditions. Recently several statistical methods have been proposed to accomplish this goal when there are replicated samples under each condition. However, it may not be clear how these methods compare with each other. Our main goal here is to compare three methods, the t-test, a regression modeling approach (Thomas et al., Genome Res., 11, 1227-1236, 2001) and a mixture model approach (Pan et al., http://www.biostat.umn.edu/cgi-bin/rrs?print+2001,2001a,b) with particular attention to their different modeling assumptions. Results It is pointed out that all the three methods are based on using the two-sample t-statistic or its minor variation, but they differ in how to associate a statistical significance level to the corresponding statistic, leading to possibly large difference in the resulting significance levels and the numbers of genes detected. In particular, we give an explicit formula for the test statistic used in the regression approach. Using the leukemia data of Golub et al. (Science, 285, 531-537, 1999), we illustrate these points. We also briefly compare the results with those of several other methods, including the empirical Bayesian method of Efron et al. (J. Am. Stat. Assoc., to appear, 2001) and the Significance Analysis of Microarray (SAM) method of Tusher et al. (PROC: Natl Acad. Sci. USA, 98, 5116-5121, 2001).

555 citations

Journal ArticleDOI
TL;DR: This article presents a powerful association test based on data-adaptive modifications to a so-called Sum test originally proposed for common variants, which aims to strike a balance between utilizing information on multiple markers in linkage disequilibrium and reducing the cost of large degrees of freedom or of multiple testing adjustment.
Abstract: Since associations between complex diseases and common variants are typically weak, and approaches to genotyping rare variants (e.g. by next-generation resequencing) multiply, there is an urgent demand to develop powerful association tests that are able to detect disease associations with both common and rare variants. In this article we present such a test. It is based on data-adaptive modifications to a so-called Sum test originally proposed for common variants, which aims to strike a balance between utilizing information on multiple markers in linkage disequilibrium and reducing the cost of large degrees of freedom or of multiple testing adjustment. When applied to multiple common or rare variants in a candidate region, the proposed test is easy to use with 1 degree of freedom and without the need for multiple testing adjustment. We show that the proposed test has high power across a wide range of scenarios with either common or rare variants, or both. In particular, in some situations the proposed test performs better than several commonly used methods.

308 citations

Journal Article
TL;DR: A penalized likelihood approach with an L1 penalty function is proposed, automatically realizing variable selection via thresholding and delivering a sparse solution in model-based clustering analysis with a common diagonal covariance matrix.
Abstract: Variable selection in clustering analysis is both challenging and important. In the context of model-based clustering analysis with a common diagonal covariance matrix, which is especially suitable for "high dimension, low sample size" settings, we propose a penalized likelihood approach with an L1 penalty function, automatically realizing variable selection via thresholding and delivering a sparse solution. We derive an EM algorithm to fit our proposed model, and propose a modified BIC as a model selection criterion to choose the number of components and the penalization parameter. A simulation study and an application to gene function prediction with gene expression profiles demonstrate the utility of our method.

307 citations

Journal ArticleDOI
TL;DR: Theoretically, it is shown that constrained L 0 likelihood and its computational surrogate are optimal in that they achieve feature selection consistency andsharp parameter estimation, under one necessary condition required for any method to be selection consistent and to achieve sharp parameter estimation.
Abstract: In high-dimensional data analysis, feature selection becomes one effective means for dimension reduction, which proceeds with parameter estimation. Concerning accuracy of selection and estimation, we study nonconvex constrained and regularized likelihoods in the presence of nuisance parameters. Theoretically, we show that constrained L 0 likelihood and its computational surrogate are optimal in that they achieve feature selection consistency and sharp parameter estimation, under one necessary condition required for any method to be selection consistent and to achieve sharp parameter estimation. It permits up to exponentially many candidate features. Computationally, we develop difference convex methods to implement the computational surrogate through prime and dual subproblems. These results establish a central role of L 0 constrained and regularized likelihoods in feature selection and parameter estimation involving selection. As applications of the general method and theory, we perform feature selection...

282 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Journal Article
Fumio Tajima1
30 Oct 1989-Genomics
TL;DR: It is suggested that the natural selection against large insertion/deletion is so weak that a large amount of variation is maintained in a population.

11,521 citations

Journal ArticleDOI
TL;DR: This paper proposed parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples.
Abstract: SUMMARY Non-biological experimental variation or “batch effects” are commonly observed across multiple batches of microarray experiments, often rendering the task of combining data from these batches difficult. The ability to combine microarray data sets is advantageous to researchers to increase statistical power to detect biological phenomena from studies where logistical considerations restrict sample size or in studies that require the sequential hybridization of arrays. In general, it is inappropriate to combine data sets without adjusting for batch effects. Methods have been proposed to filter batch effects from data, but these are often complicated and require large batch sizes (>25) to implement. Because the majority of microarray studies are conducted using much smaller sample sizes, existing methods are not sufficient. We propose parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples. We illustrate our methods using two example data sets and show that our methods are justifiable, easy to apply, and useful in practice. Software for our method is freely available at: http://biosun1.harvard.edu/complab/batch/.

6,319 citations

Journal ArticleDOI
TL;DR: March 5, 2019 e1 WRITING GROUP MEMBERS Emelia J. Virani, MD, PhD, FAHA, Chair Elect On behalf of the American Heart Association Council on Epidemiology and Prevention Statistics Committee and Stroke Statistics Subcommittee.
Abstract: March 5, 2019 e1 WRITING GROUP MEMBERS Emelia J. Benjamin, MD, ScM, FAHA, Chair Paul Muntner, PhD, MHS, FAHA, Vice Chair Alvaro Alonso, MD, PhD, FAHA Marcio S. Bittencourt, MD, PhD, MPH Clifton W. Callaway, MD, FAHA April P. Carson, PhD, MSPH, FAHA Alanna M. Chamberlain, PhD Alexander R. Chang, MD, MS Susan Cheng, MD, MMSc, MPH, FAHA Sandeep R. Das, MD, MPH, MBA, FAHA Francesca N. Delling, MD, MPH Luc Djousse, MD, ScD, MPH Mitchell S.V. Elkind, MD, MS, FAHA Jane F. Ferguson, PhD, FAHA Myriam Fornage, PhD, FAHA Lori Chaffin Jordan, MD, PhD, FAHA Sadiya S. Khan, MD, MSc Brett M. Kissela, MD, MS Kristen L. Knutson, PhD Tak W. Kwan, MD, FAHA Daniel T. Lackland, DrPH, FAHA Tené T. Lewis, PhD Judith H. Lichtman, PhD, MPH, FAHA Chris T. Longenecker, MD Matthew Shane Loop, PhD Pamela L. Lutsey, PhD, MPH, FAHA Seth S. Martin, MD, MHS, FAHA Kunihiro Matsushita, MD, PhD, FAHA Andrew E. Moran, MD, MPH, FAHA Michael E. Mussolino, PhD, FAHA Martin O’Flaherty, MD, MSc, PhD Ambarish Pandey, MD, MSCS Amanda M. Perak, MD, MS Wayne D. Rosamond, PhD, MS, FAHA Gregory A. Roth, MD, MPH, FAHA Uchechukwu K.A. Sampson, MD, MBA, MPH, FAHA Gary M. Satou, MD, FAHA Emily B. Schroeder, MD, PhD, FAHA Svati H. Shah, MD, MHS, FAHA Nicole L. Spartano, PhD Andrew Stokes, PhD David L. Tirschwell, MD, MS, MSc, FAHA Connie W. Tsao, MD, MPH, Vice Chair Elect Mintu P. Turakhia, MD, MAS, FAHA Lisa B. VanWagner, MD, MSc, FAST John T. Wilkins, MD, MS, FAHA Sally S. Wong, PhD, RD, CDN, FAHA Salim S. Virani, MD, PhD, FAHA, Chair Elect On behalf of the American Heart Association Council on Epidemiology and Prevention Statistics Committee and Stroke Statistics Subcommittee

5,739 citations

Journal ArticleDOI
TL;DR: The Statistical Update represents the most up-to-date statistics related to heart disease, stroke, and the cardiovascular risk factors listed in the AHA's My Life Check - Life’s Simple 7, which include core health behaviors and health factors that contribute to cardiovascular health.
Abstract: Each chapter listed in the Table of Contents (see next page) is a hyperlink to that chapter. The reader clicks the chapter name to access that chapter. Each chapter listed here is a hyperlink. Click on the chapter name to be taken to that chapter. Each year, the American Heart Association (AHA), in conjunction with the Centers for Disease Control and Prevention, the National Institutes of Health, and other government agencies, brings together in a single document the most up-to-date statistics related to heart disease, stroke, and the cardiovascular risk factors listed in the AHA’s My Life Check - Life’s Simple 7 (Figure1), which include core health behaviors (smoking, physical activity, diet, and weight) and health factors (cholesterol, blood pressure [BP], and glucose control) that contribute to cardiovascular health. The Statistical Update represents …

5,102 citations