scispace - formally typeset
Search or ask a question

Showing papers in "Genetic Epidemiology in 2016"


Journal ArticleDOI
TL;DR: A novel weighted median estimator for combining data on multiple genetic variants into a single causal estimate is presented, which is consistent even when up to 50% of the information comes from invalid instrumental variables.
Abstract: Developments in genome-wide association studies and the increasing availability of summary genetic association data have made application of Mendelian randomization relatively straightforward. However, obtaining reliable results from a Mendelian randomization investigation remains problematic, as the conventional inverse-variance weighted method only gives consistent estimates if all of the genetic variants in the analysis are valid instrumental variables. We present a novel weighted median estimator for combining data on multiple genetic variants into a single causal estimate. This estimator is consistent even when up to 50% of the information comes from invalid instrumental variables. In a simulation analysis, it is shown to have better finite-sample Type 1 error rates than the inverse-variance weighted method, and is complementary to the recently proposed MR-Egger (Mendelian randomization-Egger) regression method. In analyses of the causal effects of low-density lipoprotein cholesterol and high-density lipoprotein cholesterol on coronary artery disease risk, the inverse-variance weighted method suggests a causal effect of both lipid fractions, whereas the weighted median and MR-Egger regression methods suggest a null effect of high-density lipoprotein cholesterol that corresponds with the experimental evidence. Both median-based and MR-Egger regression methods should be considered as sensitivity analyses for Mendelian randomization investigations with multiple genetic variants.

2,959 citations


Journal ArticleDOI
TL;DR: This paper performs simulation studies to investigate the magnitude of bias and Type 1 error rate inflation arising from sample overlap and considers both a continuous outcome and a case‐control setting with a binary outcome.
Abstract: Mendelian randomization analyses are often performed using summarized data. The causal estimate from a one-sample analysis (in which data are taken from a single data source) with weak instrumental variables is biased in the direction of the observational association between the risk factor and outcome, whereas the estimate from a two-sample analysis (in which data on the risk factor and outcome are taken from non-overlapping datasets) is less biased and any bias is in the direction of the null. When using genetic consortia that have partially overlapping sets of participants, the direction and extent of bias are uncertain. In this paper, we perform simulation studies to investigate the magnitude of bias and Type 1 error rate inflation arising from sample overlap. We consider both a continuous outcome and a case-control setting with a binary outcome. For a continuous outcome, bias due to sample overlap is a linear function of the proportion of overlap between the samples. So, in the case of a null causal effect, if the relative bias of the one-sample instrumental variable estimate is 10% (corresponding to an F parameter of 10), then the relative bias with 50% sample overlap is 5%, and with 30% sample overlap is 3%. In a case-control setting, if risk factor measurements are only included for the control participants, unbiased estimates are obtained even in a one-sample setting. However, if risk factor data on both control and case participants are used, then bias is similar with a binary outcome as with a continuous outcome. Consortia releasing publicly available data on the associations of genetic variants with continuous risk factors should provide estimates that exclude case participants from case-control samples.

768 citations


Journal ArticleDOI
TL;DR: Polygenic epidemiology has been used to establish a polygenic effect, estimate genetic correlation between traits, estimate how many variants affect a trait, stratify cases into subphenotypes, predict individual disease risks, and infer causal effects using Mendelian randomization.
Abstract: Much of the genetic basis of complex traits is present on current genotyping products, but the individual variants that affect the traits have largely not been identified. Several traditional problems in genetic epidemiology have recently been addressed by assuming a polygenic basis for disease and treating it as a single entity. Here I briefly review some of these applications, which collectively may be termed polygenic epidemiology. Methodologies in this area include polygenic scoring, linear mixed models, and linkage disequilibrium scoring. They have been used to establish a polygenic effect, estimate genetic correlation between traits, estimate how many variants affect a trait, stratify cases into subphenotypes, predict individual disease risks, and infer causal effects using Mendelian randomization. Polygenic epidemiology will continue to yield useful applications even while much of the specific variation underlying complex traits remains undiscovered.

134 citations


Journal ArticleDOI
TL;DR: A comprehensive meta‐analysis of 19 studies was performed, and a general model allowing the integration of the different types of cancer risk available in the literature was developed, to obtain a consensus estimate of BC penetrance.
Abstract: The gene responsible for ataxia-telangiectasia syndrome, ATM, is also an intermediate-risk breast cancer (BC) susceptibility gene. Numerous studies have been carried out to determine the contribution of ATM gene mutations to BC risk. Epidemiological cohorts, segregation analyses, and case-control studies reported BC risk in different forms, including penetrance, relative risk, standardized incidence ratio, and odds ratio. Because the reported estimates vary both qualitatively and quantitatively, we developed a general model allowing the integration of the different types of cancer risk available in the literature. We performed a comprehensive meta-analysis identifying 19 studies, and used our model to obtain a consensus estimate of BC penetrance. We estimated the cumulative risk of BC in heterozygous ATM mutation carriers to be 6.02% by 50 years of age (95% credible interval: 4.58-7.42%) and 32.83% by 80 years of age (95% credible interval: 24.55-40.43%). An accurate assessment of cancer penetrance is crucial to help mutation carriers make medical and lifestyle decisions that can reduce their chances of developing the disease.

97 citations


Journal ArticleDOI
TL;DR: A revisit and untangle major theoretical aspects of interaction tests in the special case of linear regression, and explores the advantages and limitations of multivariate interaction models, when testing for interaction between multiple SNPs and/or multiple exposures, over univariate approaches.
Abstract: The identification of gene-gene and gene-environment interaction in human traits and diseases is an active area of research that generates high expectation, and most often lead to high disappointment. This is partly explained by a misunderstanding of the inherent characteristics of standard regression-based interaction analyses. Here, I revisit and untangle major theoretical aspects of interaction tests in the special case of linear regression; in particular, I discuss variables coding scheme, interpretation of effect estimate, statistical power, and estimation of variance explained in regard of various hypothetical interaction patterns. Linking this components it appears first that the simplest biological interaction models-in which the magnitude of a genetic effect depends on a common exposure-are among the most difficult to identify. Second, I highlight the demerit of the current strategy to evaluate the contribution of interaction effects to the variance of quantitative outcomes and argue for the use of new approaches to overcome this issue. Finally, I explore the advantages and limitations of multivariate interaction models, when testing for interaction between multiple SNPs and/or multiple exposures, over univariate approaches. Together, these new insights can be leveraged for future method development and to improve our understanding of the genetic architecture of multifactorial traits.

83 citations


Journal ArticleDOI
TL;DR: A new and scalable algorithm, joint analysis of marginal summary statistics (JAM), for the re‐analysis of published marginal summary stactistics under joint multi‐SNP models is described and demonstrated identical performance to various alternatives designed for single region settings.
Abstract: Recently, large scale genome-wide association study (GWAS) meta-analyses have boosted the number of known signals for some traits into the tens and hundreds. Typically, however, variants are only analysed one-at-a-time. This complicates the ability of fine-mapping to identify a small set of SNPs for further functional follow-up. We describe a new and scalable algorithm, joint analysis of marginal summary statistics (JAM), for the re-analysis of published marginal summary statistics under joint multi-SNP models. The correlation is accounted for according to estimates from a reference dataset, and models and SNPs that best explain the complete joint pattern of marginal effects are highlighted via an integrated Bayesian penalized regression framework. We provide both enumerated and Reversible Jump MCMC implementations of JAM and present some comparisons of performance. In a series of realistic simulation studies, JAM demonstrated identical performance to various alternatives designed for single region settings. In multi-region settings, where the only multivariate alternative involves stepwise selection, JAM offered greater power and specificity. We also present an application to real published results from MAGIC (meta-analysis of glucose and insulin related traits consortium) - a GWAS meta-analysis of more than 15,000 people. We re-analysed several genomic regions that produced multiple significant signals with glucose levels 2 hr after oral stimulation. Through joint multivariate modelling, JAM was able to formally rule out many SNPs, and for one gene, ADCY5, suggests that an additional SNP, which transpired to be more biologically plausible, should be followed up with equal priority to the reported index.

75 citations


Journal ArticleDOI
TL;DR: The study of gene‐environment interactions (G×E) has been an active area of research, but little is reported about the known findings in the literature.
Abstract: Background Risk of cancer is determined by a complex interplay of genetic and environmental factors. Although the study of gene-environment interactions (G×E) has been an active area of research, little is reported about the known findings in the literature. Methods To examine the state of the science in G×E research in cancer, we performed a systematic review of published literature using gene-environment or pharmacogenomic flags from two curated databases of genetic association studies, the Human Genome Epidemiology (HuGE) literature finder and Cancer Genome-Wide Association and Meta Analyses Database (CancerGAMAdb), from January 1, 2001, to January 31, 2011. A supplemental search using HuGE was conducted for articles published from February 1, 2011, to April 11, 2013. A 25% sample of the supplemental publications was reviewed. Results A total of 3,019 articles were identified in the original search. From these articles, 243 articles were determined to be relevant based on inclusion criteria (more than 3,500 interactions). From the supplemental search (1,400 articles identified), 29 additional relevant articles (1,370 interactions) were included. The majority of publications in both searches examined G×E in colon, rectal, or colorectal; breast; or lung cancer. Specific interactions examined most frequently included environmental factors categorized as energy balance (e.g., body mass index, diet), exogenous (e.g., oral contraceptives) and endogenous hormones (e.g., menopausal status), chemical environment (e.g., grilled meats), and lifestyle (e.g., smoking, alcohol intake). In both searches, the majority of interactions examined were using loci from candidate genes studies and none of the studies were genome-wide interaction studies (GEWIS). The most commonly reported measure was the interaction P-value, of which a sizable number of P-values were considered statistically significant (i.e., <0.05). In addition, the magnitude of interactions reported was modest. Conclusion Observations of published literature suggest that opportunity exists for increased sample size in G×E research, including GWAS-identified loci in G×E studies, exploring more GWAS approaches in G×E such as GEWIS, and improving the reporting of G×E findings.

75 citations


Journal ArticleDOI
TL;DR: In this article, a simple hierarchical testing procedure was proposed to control the false discovery rate and the expected value of the average proportion of false discovery of phenotypes influenced by such variants.
Abstract: The genetic basis of multiple phenotypes such as gene expression, metabolite levels, or imaging features is often investigated by testing a large collection of hypotheses, probing the existence of association between each of the traits and hundreds of thousands of genotyped variants. Appropriate multiplicity adjustment is crucial to guarantee replicability of findings, and the false discovery rate (FDR) is frequently adopted as a measure of global error. In the interest of interpretability, results are often summarized so that reporting focuses on variants discovered to be associated to some phenotypes. We show that applying FDR-controlling procedures on the entire collection of hypotheses fails to control the rate of false discovery of associated variants as well as the expected value of the average proportion of false discovery of phenotypes influenced by such variants. We propose a simple hierarchical testing procedure that allows control of both these error rates and provides a more reliable basis for the identification of variants with functional effects. We demonstrate the utility of this approach through simulation studies comparing various error rates and measures of power for genetic association studies of multiple traits. Finally, we apply the proposed method to identify genetic variants that impact flowering phenotypes in Arabidopsis thaliana, expanding the set of discoveries.

67 citations


Journal ArticleDOI
TL;DR: The Mendelian Randomization analysis provides new evidence for a causal role of educational attainment on refractive error and suggests that observational studies may actually underestimate the true effect of education.
Abstract: Myopia is the largest cause of uncorrected visual impairments globally and its recent dramatic increase in the population has made it a major public health problem. In observational studies, educational attainment has been consistently reported to be correlated to myopia. Nonetheless, correlation does not imply causation. Observational studies do not tell us if education causes myopia or if instead there are confounding factors underlying the association. In this work, we use a two-step least squares instrumental-variable (IV) approach to estimate the causal effect of education on refractive error, specifically myopia. We used the results from the educational attainment GWAS from the Social Science Genetic Association Consortium to define a polygenic risk score (PGRS) in three cohorts of late middle age and elderly Caucasian individuals (N = 5,649). In a meta-analysis of the three cohorts, using the PGRS as an IV, we estimated that each z-score increase in education (approximately 2 years of education) results in a reduction of 0.92 ± 0.29 diopters (P = 1.04 × 10(-3) ). Our estimate of the effect of education on myopia was higher (P = 0.01) than the observed estimate (0.25 ± 0.03 diopters reduction per education z-score [∼2 years] increase). This suggests that observational studies may actually underestimate the true effect. Our Mendelian Randomization (MR) analysis provides new evidence for a causal role of educational attainment on refractive error.

61 citations


Journal ArticleDOI
TL;DR: It is shown that MANOVA is generally very powerful for detecting association but there are situations, such as when a genetic variant is associated with all the traits, where MANOVA may not have any detection power, and a unified score‐based test statistic USAT is proposed that can perform better than MANOVA in such situations and nearly as well as MANOVA elsewhere.
Abstract: Genome-wide association studies (GWASs) for complex diseases often collect data on multiple correlated endo-phenotypes. Multivariate analysis of these correlated phenotypes can improve the power to detect genetic variants. Multivariate analysis of variance (MANOVA) can perform such association analysis at a GWAS level, but the behavior of MANOVA under different trait models has not been carefully investigated. In this paper, we show that MANOVA is generally very powerful for detecting association but there are situations, such as when a genetic variant is associated with all the traits, where MANOVA may not have any detection power. In these situations, marginal model based methods, however, perform much better than multivariate methods. We investigate the behavior of MANOVA, both theoretically and using simulations, and derive the conditions where MANOVA loses power. Based on our findings, we propose a unified score-based test statistic USAT that can perform better than MANOVA in such situations and nearly as well as MANOVA elsewhere. Our proposed test reports an approximate asymptotic P-value for association and is computationally very efficient to implement at a GWAS level. We have studied through extensive simulations the performance of USAT, MANOVA, and other existing approaches and demonstrated the advantage of using the USAT approach to detect association between a genetic variant and multivariate phenotypes. We applied USAT to data from three correlated traits collected on 5, 816 Caucasian individuals from the Atherosclerosis Risk in Communities (ARIC, The ARIC Investigators []) Study and detected some interesting associations.

44 citations


Journal ArticleDOI
TL;DR: This work derives an exact test for KAT with continuous traits, which resolve the small sample conservatism of KAT without the need for resampling, and proposes a similar approximate test for binary traits that has significantly improved power to detect association for microbiome studies.
Abstract: Kernel machine based association tests (KAT) have been increasingly used in testing the association between an outcome and a set of biological measurements due to its power to combine multiple weak signals of complex relationship with the outcome through the specification of a relevant kernel. Human genetic and microbiome association studies are two important applications of KAT. However, the classic KAT framework relies on large sample theory, and conservativeness has been observed for small sample studies, especially for microbiome association studies. The common approach for addressing the small sample problem relies on computationally intensive resampling methods. Here, we derive an exact test for KAT with continuous traits, which resolve the small sample conservatism of KAT without the need for resampling. The exact test has significantly improved power to detect association for microbiome studies. For binary traits, we propose a similar approximate test, and we show that the approximate test is very powerful for a wide range of kernels including common variant- and microbiome-based kernels, and the approximate test controls the type I error well for these kernels. In contrast, the sequence kernel association tests have slightly inflated genomic inflation factors after small sample adjustment. Extensive simulations and application to a real microbiome association study are used to demonstrate the utility of our method.

Journal ArticleDOI
TL;DR: The commonly used sequence kernel association test (SKAT) for single‐trait analysis is extended to test for the joint association of rare variant sets with multiple traits to identify an exome‐wide significant rare variant set in the gene YAP1 worthy of further investigations.
Abstract: Genetic studies often collect multiple correlated traits, which could be analyzed jointly to increase power by aggregating multiple weak effects and provide additional insights into the etiology of complex human diseases. Existing methods for multiple trait association tests have primarily focused on common variants. There is a surprising dearth of published methods for testing the association of rare variants with multiple correlated traits. In this paper, we extend the commonly used sequence kernel association test (SKAT) for single-trait analysis to test for the joint association of rare variant sets with multiple traits. We investigate the performance of the proposed method through extensive simulation studies. We further illustrate its usefulness with application to the analysis of diabetes-related traits in the Atherosclerosis Risk in Communities (ARIC) Study. We identified an exome-wide significant rare variant set in the gene YAP1 worthy of further investigations.

Journal ArticleDOI
TL;DR: The results provide insights on the age span during which myopia genes exert their effect, and form the basis for understanding the mechanisms underlying high and pathological myopia.
Abstract: Previous studies have identified many genetic loci for refractive error and myopia. We aimed to investigate the effect of these loci on ocular biometry as a function of age in children, adolescents, and adults. The study population consisted of three age groups identified from the international CREAM consortium: 5,490 individuals aged 25 years. All participants had undergone standard ophthalmic examination including measurements of axial length (AL) and corneal radius (CR). We examined the lead SNP at all 39 currently known genetic loci for refractive error identified from genome-wide association studies (GWAS), as well as a combined genetic risk score (GRS). The beta coefficient for association between SNP genotype or GRS versus AL/CR was compared across the three age groups, adjusting for age, sex, and principal components. Analyses were Bonferroni-corrected. In the age group <10 years, three loci (GJD2, CHRNG, ZIC2) were associated with AL/CR. In the age group 10–25 years, four loci (BMP2, KCNQ5, A2BP1, CACNA1D) were associated; and in adults 20 loci were associated. Association with GRS increased with age; β = 0.0016 per risk allele (P = 2 × 10–8) in <10 years, 0.0033 (P = 5 × 10–15) in 10- to 25-year-olds, and 0.0048 (P = 1 × 10–72) in adults. Genes with strongest effects (LAMA2, GJD2) had an early effect that increased with age. Our results provide insights on the age span during which myopia genes exert their effect. These insights form the basis for understanding the mechanisms underlying high and pathological myopia.

Journal ArticleDOI
TL;DR: The goal was to identify candidate genes with rare genetic variants for NSCLP in a Honduran population using whole exome sequencing, and preliminary results identified 3,727 heterozygous rare variants that were predicted to be functionally consequential.
Abstract: Studies suggest that nonsyndromic cleft lip and palate (NSCLP) is polygenic with variable penetrance, presenting a challenge in identifying all causal genetic variants. Despite relatively high prevalence of NSCLP among Amerindian populations, no large whole exome sequencing (WES) studies have been completed in this population. Our goal was to identify candidate genes with rare genetic variants for NSCLP in a Honduran population using WES. WES was performed on two to four members of 27 multiplex Honduran families. Genetic variants with a minor allele frequency > 1% in reference databases were removed. Heterozygous variants consistent with dominant disease with incomplete penetrance were ascertained, and variants with predicted functional consequence were prioritized for analysis. Pedigree-specific P-values were calculated as the probability of all affected members in the pedigree being carriers, given that at least one is a carrier. Preliminary results identified 3,727 heterozygous rare variants; 1,282 were predicted to be functionally consequential. Twenty-three genes had variants of interest in ≥3 families, where some genes had different variants in each family, giving a total of 50 variants. Variant validation via Sanger sequencing of the families and unrelated unaffected controls excluded variants that were sequencing errors or common variants not in databases, leaving four genes with candidate variants in ≥3 families. Of these, candidate variants in two genes consistently segregate with NSCLP as a dominant variant with incomplete penetrance: ACSS2 and PHYH. Rare variants found at the same gene in all affected individuals in several families are likely to be directly related to NSCLP.

Journal ArticleDOI
TL;DR: This study demonstrates that principal components generally result in higher heritability and linkage evidence than individual traits, and PCHs can provide useful traits for using data on multiple phenotypes and for genetic studies of trans‐ethnic populations.
Abstract: A disease trait often can be characterized by multiple phenotypic measurements that can provide complementary information on disease etiology, physiology, or clinical manifestations. Given that multiple phenotypes may be correlated and reflect common underlying genetic mechanisms, the use of multivariate analysis of multiple traits may improve statistical power to detect genes and variants underlying complex traits. The literature, however, has been unclear as to the optimal approach for analyzing multiple correlated traits. In this study, heritability and linkage analysis was performed for six obstructive sleep apnea hypopnea syndrome (OSAHS) related phenotypes, as well as principal components of the phenotypes and principal components of the heritability (PCHs) using the data from Cleveland Family Study, which include both African and European American families. Our study demonstrates that principal components generally result in higher heritability and linkage evidence than individual traits. Furthermore, the PCHs can be transferred across populations, strongly suggesting that these PCHs reflect traits with common underlying genetic mechanisms for OSAHS across populations. Thus, PCHs can provide useful traits for using data on multiple phenotypes and for genetic studies of trans-ethnic populations.

Journal ArticleDOI
TL;DR: It is shown that the Paré et al. approach has an inflated false‐positive rate in the presence of an environmental marginal effect, and an alternative that remains valid is proposed, and a novel 2‐step approach that combines the two screening approaches is proposed that can outperform other GWIS approaches.
Abstract: A genome-wide association study (GWAS) typically is focused on detecting marginal genetic effects. However, many complex traits are likely to be the result of the interplay of genes and environmental factors. These SNPs may have a weak marginal effect and thus unlikely to be detected from a scan of marginal effects, but may be detectable in a gene-environment (G × E) interaction analysis. However, a genome-wide interaction scan (GWIS) using a standard test of G × E interaction is known to have low power, particularly when one corrects for testing multiple SNPs. Two 2-step methods for GWIS have been previously proposed, aimed at improving efficiency by prioritizing SNPs most likely to be involved in a G × E interaction using a screening step. For a quantitative trait, these include a method that screens on marginal effects [Kooperberg and Leblanc, 2008] and a method that screens on variance heterogeneity by genotype [Pare et al., 2010] In this paper, we show that the Pare et al. approach has an inflated false-positive rate in the presence of an environmental marginal effect, and we propose an alternative that remains valid. We also propose a novel 2-step approach that combines the two screening approaches, and provide simulations demonstrating that the new method can outperform other GWIS approaches. Application of this method to a G × Hispanic-ethnicity scan for childhood lung function reveals a SNP near the MARCO locus that was not identified by previous marginal-effect scans.

Journal ArticleDOI
TL;DR: In this article, the authors compare the two frameworks using results from genome-wide association studies of systolic blood pressure for 3.2 million low frequency and 6.5 million common variants across 20 cohorts of European ancestry, comprising 79,731 individuals.
Abstract: Studying gene-environment (G × E) interactions is important, as they extend our knowledge of the genetic architecture of complex traits and may help to identify novel variants not detected via analysis of main effects alone. The main statistical framework for studying G × E interactions uses a single regression model that includes both the genetic main and G × E interaction effects (the “joint” framework). The alternative “stratified” framework combines results from genetic main-effect analyses carried out separately within the exposed and unexposed groups. Although there have been several investigations using theory and simulation, an empirical comparison of the two frameworks is lacking. Here, we compare the two frameworks using results from genome-wide association studies of systolic blood pressure for 3.2 million low frequency and 6.5 million common variants across 20 cohorts of European ancestry, comprising 79,731 individuals. Our cohorts have sample sizes ranging from 456 to 22,983 and include both family-based and population-based samples. In cohort-specific analyses, the two frameworks provided similar inference for population-based cohorts. The agreement was reduced for family-based cohorts. In meta-analyses, agreement between the two frameworks was less than that observed in cohort-specific analyses, despite the increased sample size. In meta-analyses, agreement depended on (1) the minor allele frequency, (2) inclusion of family-based cohorts in meta-analysis, and (3) filtering scheme. The stratified framework appears to approximate the joint framework well only for common variants in population-based cohorts. We conclude that the joint framework is the preferred approach and should be used to control false positives when dealing with low-frequency variants and/or family-based cohorts.

Journal ArticleDOI
TL;DR: This contribution aims at approaching the omics and non‐omics data integration from the epidemiology scope by considering the “massive” inclusion of variables in the risk assessment and predictive models.
Abstract: Primary and secondary prevention can highly benefit a personalized medicine approach through the accurate discrimination of individuals at high risk of developing a specific disease from those at moderate and low risk. To this end precise risk prediction models need to be built. This endeavor requires a precise characterization of the individual exposome, genome, and phenome. Massive molecular omics data representing the different layers of the biological processes of the host and the nonhost will enable to build more accurate risk prediction models. Epidemiologists aim to integrate omics data along with important information coming from other sources (questionnaires, candidate markers) that has been proved to be relevant in the discrimination risk assessment of complex diseases. However, the integrative models in large-scale epidemiologic research are still in their infancy and they face numerous challenges, some of them at the analytical stage. So far, there are a small number of studies that have integrated more than two omics data sets, and the inclusion of non-omics data in the same models is still missing in most of studies. In this contribution, we aim at approaching the omics and non-omics data integration from the epidemiology scope by considering the "massive" inclusion of variables in the risk assessment and predictive models. We also provide already available examples of integrative contributions in the field, propose analytical strategies that allow considering both omics and non-omics data in the models, and finally review the challenges imbedding this type of research.

Journal ArticleDOI
TL;DR: The ability of the approach to identify biologically plausible SNP‐education interactions relative to Alzheimer's disease status using genome‐wide association study data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) is demonstrated.
Abstract: Although gene-environment (G× E) interactions play an important role in many biological systems, detecting these interactions within genome-wide data can be challenging due to the loss in statistical power incurred by multiple hypothesis correction. To address the challenge of poor power and the limitations of existing multistage methods, we recently developed a screening-testing approach for G× E interaction detection that combines elastic net penalized regression with joint estimation to support a single omnibus test for the presence of G× E interactions. In our original work on this technique, however, we did not assess type I error control or power and evaluated the method using just a single, small bladder cancer data set. In this paper, we extend the original method in two important directions and provide a more rigorous performance evaluation. First, we introduce a hierarchical false discovery rate approach to formally assess the significance of individual G× E interactions. Second, to support the analysis of truly genome-wide data sets, we incorporate a score statistic-based prescreening step to reduce the number of single nucleotide polymorphisms prior to fitting the first stage penalized regression model. To assess the statistical properties of our method, we compare the type I error rate and statistical power of our approach with competing techniques using both simple simulation designs as well as designs based on real disease architectures. Finally, we demonstrate the ability of our approach to identify biologically plausible SNP-education interactions relative to Alzheimer's disease status using genome-wide association study data from the Alzheimer's Disease Neuroimaging Initiative (ADNI).

Journal ArticleDOI
TL;DR: Simulations show that MetaCor controls inflation better than alternatives such as ignoring the correlation between the strata or analyzing all strata together in a “pooled” GWAS, especially with different minor allele frequencies (MAFs) between strata.
Abstract: Investigators often meta-analyze multiple genome-wide association studies (GWASs) to increase the power to detect associations of single nucleotide polymorphisms (SNPs) with a trait. Meta-analysis is also performed within a single cohort that is stratified by, e.g., sex or ancestry group. Having correlated individuals among the strata may complicate meta-analyses, limit power, and inflate Type 1 error. For example, in the Hispanic Community Health Study/Study of Latinos (HCHS/SOL), sources of correlation include genetic relatedness, shared household, and shared community. We propose a novel mixed-effect model for meta-analysis, "MetaCor," which accounts for correlation between stratum-specific effect estimates. Simulations show that MetaCor controls inflation better than alternatives such as ignoring the correlation between the strata or analyzing all strata together in a "pooled" GWAS, especially with different minor allele frequencies (MAFs) between strata. We illustrate the benefits of MetaCor on two GWASs in the HCHS/SOL. Analysis of dental caries (tooth decay) stratified by ancestry group detected a genome-wide significant SNP (rs7791001, P-value = 3.66×10-8, compared to 4.67×10-7 in pooled), with different MAFs between strata. Stratified analysis of body mass index (BMI) by ancestry group and sex reduced overall inflation from λGC=1.050 (pooled) to λGC=1.028 (MetaCor). Furthermore, even after removing close relatives to obtain nearly uncorrelated strata, a naive stratified analysis resulted in λGC=1.058 compared to λGC=1.027 for MetaCor.

Journal ArticleDOI
TL;DR: In the application to a large GWAS, it is found that the modified B–H procedure also performs well, indicating that this may be an optimal approach for determining the traits underlying a pleiotropic signal.
Abstract: Discovering pleiotropic loci is important to understand the biological basis of seemingly distinct phenotypes. Most methods for assessing pleiotropy only test for the overall association between genetic variants and multiple phenotypes. To determine which specific traits are pleiotropic, we evaluate via simulation and application three different strategies. The first is model selection techniques based on the inverse regression of genotype on phenotypes. The second is a subset-based meta analysis ASSET [Bhattacharjee et al., 2012], which provides an optimal subset of nonnull traits. And the third is a modified Benjamini–Hochberg (B-H) procedure of controlling the expected false discovery rate [Benjamini and Hochberg, 1995] in the framework of phenome-wide association study. From our simulations we see that an inverse regression-based approach MultiPhen [O'Reilly et al., 2012] is more powerful than ASSET for detecting overall pleiotropic association, except for when all the phenotypes are associated and have genetic effects in the same direction. For determining which specific traits are pleiotropic, the modified B–H procedure performs consistently better than the other two methods. The inverse regression-based selection methods perform competitively with the modified B–H procedure only when the phenotypes are weakly correlated. The efficiency of ASSET is observed to lie below and in between the efficiency of the other two methods when the traits are weakly and strongly correlated, respectively. In our application to a large GWAS, we find that the modified B–H procedure also performs well, indicating that this may be an optimal approach for determining the traits underlying a pleiotropic signal.

Journal ArticleDOI
TL;DR: This work proposes both an empirical method to estimate this prior variance, and a coherent approach to using SNP‐level functional data, to inform the prior probability of causal association, and shows that assigning SNP‐specific prior probabilities of association based on expert prior functional knowledge of the disease mechanism can lead to improved causal SNPs ranks.
Abstract: There is a large amount of functional genetic data available, which can be used to inform fine-mapping association studies (in diseases with well-characterised disease pathways). Single nucleotide polymorphism (SNP) prioritization via Bayes factors is attractive because prior information can inform the effect size or the prior probability of causal association. This approach requires the specification of the effect size. If the information needed to estimate a priori the probability density for the effect sizes for causal SNPs in a genomic region isn't consistent or isn't available, then specifying a prior variance for the effect sizes is challenging. We propose both an empirical method to estimate this prior variance, and a coherent approach to using SNP-level functional data, to inform the prior probability of causal association. Through simulation we show that when ranking SNPs by our empirical Bayes factor in a fine-mapping study, the causal SNP rank is generally as high or higher than the rank using Bayes factors with other plausible values of the prior variance. Importantly, we also show that assigning SNP-specific prior probabilities of association based on expert prior functional knowledge of the disease mechanism can lead to improved causal SNPs ranks compared to ranking with identical prior probabilities of association. We demonstrate the use of our methods by applying the methods to the fine mapping of the CASP8 region of chromosome 2 using genotype data from the Collaborative Oncological Gene-Environment Study (COGS) Consortium. The data we analysed included approximately 46,000 breast cancer case and 43,000 healthy control samples.

Journal ArticleDOI
TL;DR: Cox proportional hazard models using functional regression (FR) to perform gene‐based association analysis of survival traits while adjusting for covariates and likelihood ratio test (LRT) statistics to test for associations between the survival traits and multiple genetic variants in a genetic region are developed.
Abstract: Genetic studies of survival outcomes have been proposed and conducted recently, but statistical methods for identifying genetic variants that affect disease progression are rarely developed Motivated by our ongoing real studies, here we develop Cox proportional hazard models using functional regression (FR) to perform gene-based association analysis of survival traits while adjusting for covariates The proposed Cox models are fixed effect models where the genetic effects of multiple genetic variants are assumed to be fixed We introduce likelihood ratio test (LRT) statistics to test for associations between the survival traits and multiple genetic variants in a genetic region Extensive simulation studies demonstrate that the proposed Cox RF LRT statistics have well-controlled type I error rates To evaluate power, we compare the Cox FR LRT with the previously developed burden test (BT) in a Cox model and sequence kernel association test (SKAT), which is based on mixed effect Cox models The Cox FR LRT statistics have higher power than or similar power as Cox SKAT LRT except when 50%/50% causal variants had negative/positive effects and all causal variants are rare In addition, the Cox FR LRT statistics have higher power than Cox BT LRT The models and related test statistics can be useful in the whole genome and whole exome association studies An age-related macular degeneration dataset was analyzed as an example

Journal ArticleDOI
TL;DR: The interaction‐term genomic inflation factor (lambda) showed inflation and deflation that varied with sample size and allele frequency; that similar lambda variation occurred in the absence of population substructure; and that lambda was strongly related to heteroskedasticity but not to minor non‐normality of phenotypes.
Abstract: Adequate control of type I error rates will be necessary in the increasing genome-wide search for interactive effects on complex traits. After observing unexpected variability in type I error rates from SNP-by-genome interaction scans, we sought to characterize this variability and test the ability of heteroskedasticity-consistent standard errors to correct it. We performed 81 SNP-by-genome interaction scans using a product-term model on quantitative traits in a sample of 1,053 unrelated European Americans from the NHLBI Family Heart Study, and additional scans on five simulated datasets. We found that the interaction-term genomic inflation factor (lambda) showed inflation and deflation that varied with sample size and allele frequency; that similar lambda variation occurred in the absence of population substructure; and that lambda was strongly related to heteroskedasticity but not to minor non-normality of phenotypes. Heteroskedasticity-consistent standard errors narrowed the range of lambda, with HC3 outperforming HC0, but in individual scans tended to create new P-value outliers related to sparse two-locus genotype classes. We explain the lambda variation as a result of non-independence of test statistics coupled with stochastic biases in test statistics due to a failure of the test to reach asymptotic properties. We propose that one way to interpret lambda is by comparison to an empirical distribution generated from data simulated under the null hypothesis and without population substructure. We further conclude that the interaction-term lambda should not be used to adjust test statistics and that heteroskedasticity-consistent standard errors come with limitations that may outweigh their benefits in this setting.

Journal ArticleDOI
TL;DR: The approach accommodates several kernels that are widely used in SNP analysis, such as the linear kernel and the Identity by State (IBS) kernel and provides practically useful utilities to prioritize SNPs, and fills the gap between SNP set analysis and biological functional studies.
Abstract: Kernel machine learning methods, such as the SNP-set kernel association test (SKAT), have been widely used to test associations between traits and genetic polymorphisms. In contrast to traditional single-SNP analysis methods, these methods are designed to examine the joint effect of a set of related SNPs (such as a group of SNPs within a gene or a pathway) and are able to identify sets of SNPs that are associated with the trait of interest. However, as with many multi-SNP testing approaches, kernel machine testing can draw conclusion only at the SNP-set level, and does not directly inform on which one(s) of the identified SNP set is actually driving the associations. A recently proposed procedure, KerNel Iterative Feature Extraction (KNIFE), provides a general framework for incorporating variable selection into kernel machine methods. In this article, we focus on quantitative traits and relatively common SNPs, and adapt the KNIFE procedure to genetic association studies and propose an approach to identify driver SNPs after the application of SKAT to gene set analysis. Our approach accommodates several kernels that are widely used in SNP analysis, such as the linear kernel and the Identity by State (IBS) kernel. The proposed approach provides practically useful utilities to prioritize SNPs, and fills the gap between SNP set analysis and biological functional studies. Both simulation studies and real data application are used to demonstrate the proposed approach.

Journal ArticleDOI
TL;DR: The utility of whole genome sequence and innovative analyses for identifying candidate regions influencing complex phenotypes are demonstrated, including 16 carnitine‐related metabolites that are important components of mammalian energy metabolism.
Abstract: We use whole genome sequence data and rare variant analysis methods to investigate a subset of the human serum metabolome, including 16 carnitine-related metabolites that are important components of mammalian energy metabolism. Medium pass sequence data consisting of 12,820,347 rare variants and serum metabolomics data were available on 1,456 individuals. By applying a penalization method, we identified two genes FGF8 and MDGA2 with significant effects on lysine and cis-4-decenoylcarnitine, respectively, using Δ-AIC and likelihood ratio test statistics. Single variant analyses in these regions did not identify a single low-frequency variant (minor allele count > 3) responsible for the underlying signal. The results demonstrate the utility of whole genome sequence and innovative analyses for identifying candidate regions influencing complex phenotypes.

Journal ArticleDOI
TL;DR: Multiethnic studies had greater power than single‐ethnicity studies at many loci, with inclusion of African Americans providing the largest impact, and association studies between rare variants and complex disease should consider including subjects from multiple ethnicities.
Abstract: Several methods have been proposed to increase power in rare variant association testing by aggregating information from individual rare variants (MAF < 0.005). However, how to best combine rare variants across multiple ethnicities and the relative performance of designs using different ethnic sampling fractions remains unknown. In this study, we compare the performance of several statistical approaches for assessing rare variant associations across multiple ethnicities. We also explore how different ethnic sampling fractions perform, including single-ethnicity studies and studies that sample up to four ethnicities. We conducted simulations based on targeted sequencing data from 4,611 women in four ethnicities (African, European, Japanese American, and Latina). As with single-ethnicity studies, burden tests had greater power when all causal rare variants were deleterious, and variance component-based tests had greater power when some causal rare variants were deleterious and some were protective. Multiethnic studies had greater power than single-ethnicity studies at many loci, with inclusion of African Americans providing the largest impact. On average, studies including African Americans had as much as 20% greater power than equivalently sized studies without African Americans. This suggests that association studies between rare variants and complex disease should consider including subjects from multiple ethnicities, with preference given to genetically diverse groups.

Journal ArticleDOI
TL;DR: Flexibility and competitive power of the functional linear model approach is demonstrated by contrasting its performance with commonly used statistical tools and its potential for discovery and characterization of genetic architecture of complex traits using sequencing data from the Dallas Heart Study is illustrated.
Abstract: Recent technological advances equipped researchers with capabilities that go beyond traditional genotyping of loci known to be polymorphic in a general population. Genetic sequences of study participants can now be assessed directly. This capability removed technology-driven bias toward scoring predominantly common polymorphisms and let researchers reveal a wealth of rare and sample-specific variants. Although the relative contributions of rare and common polymorphisms to trait variation are being debated, researchers are faced with the need for new statistical tools for simultaneous evaluation of all variants within a region. Several research groups demonstrated flexibility and good statistical power of the functional linear model approach. In this work we extend previous developments to allow inclusion of multiple traits and adjustment for additional covariates. Our functional approach is unique in that it provides a nuanced depiction of effects and interactions for the variables in the model by representing them as curves varying over a genetic region. We demonstrate flexibility and competitive power of our approach by contrasting its performance with commonly used statistical tools and illustrate its potential for discovery and characterization of genetic architecture of complex traits using sequencing data from the Dallas Heart Study.

Journal ArticleDOI
TL;DR: The Framingham Heart Study data is used to demonstrate the promising performance of the new methods as well as inconsistent results produced by the standard MR analysis that relies on a single measurement of the exposure at some arbitrary time point.
Abstract: A Mendelian randomization (MR) analysis is performed to analyze the causal effect of an exposure variable on a disease outcome in observational studies, by using genetic variants that affect the disease outcome only through the exposure variable. This method has recently gained popularity among epidemiologists given the success of genetic association studies. Many exposure variables of interest in epidemiological studies are time varying, for example, body mass index (BMI). Although longitudinal data have been collected in many cohort studies, current MR studies only use one measurement of a time-varying exposure variable, which cannot adequately capture the long-term time-varying information. We propose using the functional principal component analysis method to recover the underlying individual trajectory of the time-varying exposure from the sparsely and irregularly observed longitudinal data, and then conduct MR analysis using the recovered curves. We further propose two MR analysis methods. The first assumes a cumulative effect of the time-varying exposure variable on the disease risk, while the second assumes a time-varying genetic effect and employs functional regression models. We focus on statistical testing for a causal effect. Our simulation studies mimicking the real data show that the proposed functional data analysis based methods incorporating longitudinal data have substantial power gains compared to standard MR analysis using only one measurement. We used the Framingham Heart Study data to demonstrate the promising performance of the new methods as well as inconsistent results produced by the standard MR analysis that relies on a single measurement of the exposure at some arbitrary time point.

Journal ArticleDOI
TL;DR: In this paper, nine rare missense variants at evolutionarily conserved sites in TCIRG1 were associated with lower absolute neutrophil count (ANC; p = 0.005).
Abstract: Neutrophils are a key component of innate immunity. Individuals with low neutrophil count are susceptible to frequent infections. Linkage and association between congenital neutropenia and a single rare missense variant in TCIRG1 have been reported in a single family. Here, we report on nine rare missense variants at evolutionarily conserved sites in TCIRG1 that are associated with lower absolute neutrophil count (ANC; p = 0.005) in 1,058 participants from three cohorts: Atherosclerosis Risk in Communities (ARIC), Coronary Artery Risk Development in Young Adults (CARDIA), and Jackson Heart Study (JHS) of the NHLBI Grand Opportunity Exome Sequencing Project (GO ESP). These results validate the effects of TCIRG1 coding variation on ANC and suggest that this gene may be associated with a spectrum of mild to severe effects on ANC.