scispace - formally typeset
Search or ask a question

Showing papers in "Genetic Epidemiology in 2013"


Journal ArticleDOI
TL;DR: It is concluded that Mendelian randomization investigations using summarized data from uncorrelated variants are similarly efficient to those using individual‐level data, although the necessary assumptions cannot be so fully assessed.
Abstract: Genome-wide association studies, which typically report regression coefficients summarizing the associations of many genetic variants with various traits, are potentially a powerful source of data for Mendelian randomization investigations. We demonstrate how such coefficients from multiple variants can be combined in a Mendelian randomization analysis to estimate the causal effect of a risk factor on an outcome. The bias and efficiency of estimates based on summarized data are compared to those based on individual-level data in simulation studies. We investigate the impact of gene–gene interactions, linkage disequilibrium, and ‘weak instruments’ on these estimates. Both an inverse-variance weighted average of variant-specific associations and a likelihood-based approach for summarized data give similar estimates and precision to the two-stage least squares method for individual-level data, even when there are gene–gene interactions. However, these summarized data methods overstate precision when variants are in linkage disequilibrium. If the P-value in a linear regression of the risk factor for each variant is less than , then weak instrument bias will be small. We use these methods to estimate the causal association of low-density lipoprotein cholesterol (LDL-C) on coronary artery disease using published data on five genetic variants. A 30% reduction in LDL-C is estimated to reduce coronary artery disease risk by 67% (95% CI: 54% to 76%). We conclude that Mendelian randomization investigations using summarized data from uncorrelated variants are similarly efficient to those using individual-level data, although the necessary assumptions cannot be so fully assessed.

2,003 citations


Journal ArticleDOI
TL;DR: The sequence kernel association test (SKAT) is extended to be applicable to family data and has higher power than competing methods in many different scenarios, and is illustrated using glycemic traits from the Framingham Heart Study.
Abstract: A large number of rare genetic variants have been discovered with the development in sequencing technology and the lowering of sequencing costs. Rare variant analysis may help identify novel genes associated with diseases and quantitative traits, adding to our knowledge of explaining heritability of these phenotypes. Many statistical methods for rare variant analysis have been developed in recent years, but some of them require the strong assumption that all rare variants in the analysis share the same direction of effect, and others requiring permutation to calculate the P-values are computer intensive. Among these methods, the sequence kernel association test (SKAT) is a powerful method under many different scenarios. It does not require any assumption on the directionality of effects, and statistical significance is computed analytically. In this paper, we extend SKAT to be applicable to family data. The family-based SKAT (famSKAT) has a different test statistic and null distribution compared to SKAT, but is equivalent to SKAT when there is no familial correlation. Our simulation studies show that SKAT has inflated type I error if familial correlation is inappropriately ignored, but has appropriate type I error if applied to a single individual per family to obtain an unrelated subset. In contrast, famSKAT has the correct type I error when analyzing correlated observations, and it has higher power than competing methods in many different scenarios. We illustrate our approach to analyze the association of rare genetic variants using glycemic traits from the Framingham Heart Study.

214 citations


Journal ArticleDOI
TL;DR: This paper uses analytic calculation and simulation to compare the empirical type I error rate and power of four logistic regression based tests: Wald, score, likelihood ratio, and Firth bias‐corrected, and establishes MAC as the key parameter determining test calibration for joint and meta‐analysis.
Abstract: In genome-wide association studies of binary traits, investigators typically use logistic regression to test common variants for disease association within studies, and combine association results across studies using meta-analysis. For common variants, logistic regression tests are well calibrated, and meta-analysis of study-specific association results is only slightly less powerful than joint analysis of the combined individual-level data. In recent sequencing and dense chip based association studies, investigators increasingly test low-frequency variants for disease association. In this paper, we seek to (1) identify the association test with maximal power among tests with well controlled type I error rate and (2) compare the relative power of joint and meta-analysis tests. We use analytic calculation and simulation to compare the empirical type I error rate and power of four logistic regression based tests: Wald, score, likelihood ratio, and Firth bias-corrected. We demonstrate for low-count variants (roughly minor allele count [MAC] < 400) that: (1) for joint analysis, the Firth test has the best combination of type I error and power; (2) for meta-analysis of balanced studies (equal numbers of cases and controls), the score test is best, but is less powerful than Firth test based joint analysis; and (3) for meta-analysis of sufficiently unbalanced studies, all four tests can be anti-conservative, particularly the score test. We also establish MAC as the key parameter determining test calibration for joint and meta-analysis.

146 citations


Journal ArticleDOI
TL;DR: It is demonstrated that, although most aggregative variant association tests are designed for common genetic diseases, these tests can be easily adopted as rare Mendelian disease‐gene finders with a simple ranking‐by‐statistical‐significance protocol, and the performance compares very favorably to state‐of‐art filtering approaches.
Abstract: The need for improved algorithmic support for variant prioritization and disease-gene identification in per- sonal genomes data is widely acknowledged. We previously presented the Variant Annotation, Analysis, and Search Tool (VAAST), which employs an aggregative variant association test that combines both amino acid substitution (AAS) and allele frequencies. Here we describe and benchmark VAAST 2.0, which uses a novel conservation-controlled AAS matrix (CASM), to incorporate information about phylogenetic conservation. We show that the CASM approach improves VAAST's variant prioritization accuracy compared to its previous implementation, and compared to SIFT, PolyPhen-2, and MutationTaster. We also show that VAAST 2.0 outperforms KBAC, WSS, SKAT, and variable threshold (VT) using published case-control datasets for Crohn disease (NOD2), hypertriglyceridemia (LPL), and breast cancer (CHEK2). VAAST 2.0 also improves search accuracy on simulated datasets across a wide range of allele frequencies, population-attributable disease risks, and al- lelic heterogeneity, factors that compromise the accuracies of other aggregative variant association tests. We also demonstrate that, although most aggregative variant association tests are designed for common genetic diseases, these tests can be easily adopted as rare Mendelian disease-gene finders with a simple ranking-by-statistical-significance protocol, and the performance compares very favorably to state-of-art filtering approaches. The latter, despite their popularity, have suboptimal performance especially with the increasing case sample size.

142 citations


Journal ArticleDOI
TL;DR: This work is the first that considers various sensible approaches for imputation in admixed populations and presents a comprehensive comparison, suggesting that the novel piecewise IBS method yields consistently higher imputation quality than other methods/software.
Abstract: Imputation in admixed populations is an important problem but challenging due to the complex linkage disequilibrium (LD) pattern. The emergence of large reference panels such as that from the 1,000 Genomes Project enables more accurate imputation in general, and in particular for admixed populations and for uncommon variants. To efficiently benefit from these large reference panels, one key issue to consider in modern genotype imputation framework is the selection of effective reference panels. In this work, we consider a number of methods for effective reference panel construction inside a hidden Markov model and specific to each target individual. These methods fall into two categories: identity-by-state (IBS) based and ancestry-weighted approach. We evaluated the performance on individuals from recently admixed populations. Our target samples include 8,421 African Americans and 3,587 Hispanic Americans from the Women' Health Initiative, which allow assessment of imputation quality for uncommon variants. Our experiments include both large and small reference panels; large, medium, and small target samples; and in genome regions of varying levels of LD. We also include BEAGLE and IMPUTE2 for comparison. Experiment results with large reference panel suggest that our novel piecewise IBS method yields consistently higher imputation quality than other methods/software. The advantage is particularly noteworthy among uncommon variants where we observe up to 5.1% information gain with the difference being highly significant (Wilcoxon signed rank test P-value < 0.0001). Our work is the first that considers various sensible approaches for imputation in admixed populations and presents a comprehensive comparison.

135 citations


Journal ArticleDOI
TL;DR: A novel statistical method based on the optimal Sequence Kernel Association Test that allows for rare variant effects using continuous phenotypes in the analysis of extreme phenotype samples is proposed and the increase in power of this method is demonstrated through simulation of a wide range of scenarios.
Abstract: In the increasing number of sequencing studies aimed at identifying rare variants associated with complex traits, the power of the test can be improved by guided sampling procedures. We confirm both analytically and numerically that sampling individuals with extreme phenotypes can enrich the presence of causal rare variants and can therefore lead to an increase in power compared to random sampling. Although application of traditional rare variant association tests to these extreme phenotype samples requires dichotomizing the continuous phenotypes before analysis, the dichotomization procedure can decrease the power by reducing the information in the phenotypes. To avoid this, we propose a novel statistical method based on the optimal Sequence Kernel Association Test that allows us to test for rare variant effects using continuous phenotypes in the analysis of extreme phenotype samples. The increase in power of this method is demonstrated through simulation of a wide range of scenarios as well as in the triglyceride data of the Dallas Heart Study.

133 citations


Journal ArticleDOI
TL;DR: Analysis of simulated and real data using lasso and elastic‐net penalized support‐vector machine models, a mixed‐effects linear model, a polygenic score, and unpenalized logistic regression shows that sparse penalized approaches are robust across different disease architectures, producing as good as or better phenotype predictions and variance explained.
Abstract: A central goal of medical genetics is to accurately predict complex disease from genotypes. Here, we present a comprehensive analysis of simulated and real data using lasso and elastic-net penalized support-vector machine models, a mixed-effects linear model, a polygenic score, and unpenalized logistic regression. In simulation, the sparse penalized models achieved lower false-positive rates and higher precision than the other methods for detecting causal SNPs. The common practice of prefiltering SNP lists for subsequent penalized modeling was examined and shown to substantially reduce the ability to recover the causal SNPs. Using genome-wide SNP profiles across eight complex diseases within cross-validation, lasso and elastic-net models achieved substantially better predictive ability in celiac disease, type 1 diabetes, and Crohn's disease, and had equivalent predictive ability in the rest, with the results in celiac disease strongly replicating between independent datasets. We investigated the effect of linkage disequilibrium on the predictive models, showing that the penalized methods leverage this information to their advantage, compared with methods that assume SNP independence. Our findings show that sparse penalized approaches are robust across different disease architectures, producing as good as or better phenotype predictions and variance explained. This has fundamental ramifications for the selection and future development of methods to genetically predict human disease.

133 citations


Journal ArticleDOI
TL;DR: This paper derives a set of two score statistics, testing the group effect by variant characteristics and the heterogeneity effect, and makes a novel modification to these score statistics so that they are independent under the null hypothesis and their asymptotic distributions can be derived.
Abstract: For rare-variant association analysis, due to extreme low frequencies of these variants, it is necessary to aggregate them by a prior set (e.g., genes and pathways) in order to achieve adequate power. In this paper, we consider hierarchical models to relate a set of rare variants to phenotype by modeling the effects of variants as a function of variant characteristics while allowing for variant-specific effect (heterogeneity). We derive a set of two score statistics, testing the group effect by variant characteristics and the heterogeneity effect. We make a novel modification to these score statistics so that they are independent under the null hypothesis and their asymptotic distributions can be derived. As a result, the computational burden is greatly reduced compared with permutation-based tests. Our approach provides a general testing framework for rare variants association, which includes many commonly used tests, such as the burden test [Li and Leal, 2008] and the sequence kernel association test [Wu et al., 2011], as special cases. Furthermore, in contrast to these tests, our proposed test has an added capacity to identify which components of variant characteristics and heterogeneity contribute to the association. Simulations under a wide range of scenarios show that the proposed test is valid, robust, and powerful. An application to the Dallas Heart Study illustrates that apart from identifying genes with significant associations, the new method also provides additional information regarding the source of the association. Such information may be useful for generating hypothesis in future studies.

119 citations


Journal ArticleDOI
TL;DR: A new two‐step screening and testing method (EDG×E) that is optimized to find genes with a weak marginal effect is proposed, and application of this method to a G × Sex scan for childhood asthma reveals two potentially interesting SNPs that were not identified in the marginal‐association scan.
Abstract: In a genome-wide association study (GWAS), investigators typically focus their primary analysis on the direct (marginal) associations of each single nucleotide polymorphism (SNP) with the trait. Some SNPs that are truly associated with the trait may not be identified in this scan if they have a weak marginal effect and thus low power to be detected. However, these SNPs may be quite important in subgroups of the population defined by an environmental or personal factor, and may be detectable if such a factor is carefully considered in a gene-environment (G × E) interaction analysis. We address the question "Using a genome wide interaction scan (GWIS), can we find new genes that were not found in the primary GWAS scan?" We review commonly used approaches for conducting a GWIS in case-control studies, and propose a new two-step screening and testing method (EDG×E) that is optimized to find genes with a weak marginal effect. We simulate several scenarios in which our two-step method provides 70-80% power to detect a disease locus while a marginal scan provides less than 5% power. We also provide simulations demonstrating that the EDG×E method outperforms other GWIS approaches (including case only and previously proposed two-step methods) for finding genes with a weak marginal effect. Application of this method to a G × Sex scan for childhood asthma reveals two potentially interesting SNPs that were not identified in the marginal-association scan. We distribute a new software program (G×Escan, available at http://biostats.usc.edu/software) that implements this new method as well as several other GWIS approaches.

110 citations


Journal ArticleDOI
TL;DR: A large genome‐wide association study among individuals of European ancestry in the extended cohorts for heart and aging research in genomic epidemiology (CHARGE) VTE consortium replicated key genetic associations in F5 and ABO, and confirmed the importance of F11 and FGG loci for VTE.
Abstract: Venous thromboembolism (VTE) is a common, heritable disease resulting in high rates of hospitalization and mortality. Yet few associations between VTE and genetic variants, all in the coagulation pathway, have been established. To identify additional genetic determinants of VTE, we conducted a two-stage genome-wide association study (GWAS) among individuals of European ancestry in the extended cohorts for heart and aging research in genomic epidemiology (CHARGE) VTE consortium. The discovery GWAS comprised 1,618 incident VTE cases out of 44,499 participants from six community-based studies. Genotypes for genome-wide single-nucleotide polymorphisms (SNPs) were imputed to approximately 2.5 million SNPs in HapMap and association with VTE assessed using study-design appropriate regression methods. Meta-analysis of these results identified two known loci, in F5 and ABO. Top 1,047 tag SNPs (P ≤ 0.0016) from the discovery GWAS were tested for association in an additional 3,231 cases and 3,536 controls from three case-control studies. In the combined data from these two stages, additional genome-wide significant associations were observed on 4q35 at F11 (top SNP rs4253399, intronic to F11) and on 4q28 at FGG (rs6536024, 9.7 kb from FGG; P < 5.0 × 10(-13) for both). The associations at the FGG locus were not completely explained by previously reported variants. Loci at or near SUSD1 and OTUD7A showed borderline yet novel associations (P < 5.0 × 10(-6) ) and constitute new candidate genes. In conclusion, this large GWAS replicated key genetic associations in F5 and ABO, and confirmed the importance of F11 and FGG loci for VTE. Future studies are warranted to better characterize the associations with F11 and FGG and to replicate the new candidate associations.

109 citations


Journal ArticleDOI
TL;DR: Fisher's method consistently outperforms the minimum‐p and the individual linear and quadratic tests, as well as the optimal sequence kernel association test, SKAT‐O, and is robust across models with varying proportions of causal, deleterious, and protective rare variants, allele frequencies, and effect sizes.
Abstract: This is the peer reviewed version of the following article: ``Derkach, A., Lawless, J.F. and Sun, L. (2013). Robust and powerful tests for rare variants using Fisher's method to combine evidence of association from two or more complementary tests. Genetic Epidemiology, 37 (1), 110--121", which has been published in final form at http://onlinelibrary.wiley.com/doi/10.1002/gepi.21689/full DOI: 10.1002/gepi.21689. This article may be used for non-commercial purposes in accordance with http://olabout.wiley.com/WileyCDA/Section/id-828039.html. Wiley Terms and Conditions for Self-Archiving.

Journal ArticleDOI
TL;DR: This paper focuses on ridge regression, a penalised regression approach that has been shown to offer good performance in multivariate prediction problems, and develops an R package, ridge, which implements the automatic choice of ridge parameter, presented in this paper.
Abstract: To date, numerous genetic variants have been identified as associated with diverse phenotypic traits. However, identified associations generally explain only a small proportion of trait heritability and the predictive power of models incorporating only known-associated variants has been small. Multiple regression is a popular framework in which to consider the joint effect of many genetic variants simultaneously. Ordinary multiple regression is seldom appropriate in the context of genetic data, due to the high dimensionality of the data and the correlation structure among the predictors. There has been a resurgence of interest in the use of penalised regression techniques to circumvent these difficulties. In this paper, we focus on ridge regression, a penalised regression approach that has been shown to offer good performance in multivariate prediction problems. One challenge in the application of ridge regression is the choice of the ridge parameter that controls the amount of shrinkage of the regression coefficients. We present a method to determine the ridge parameter based on the data, with the aim of good performance in high-dimensional prediction problems. We establish a theoretical justification for our approach, and demonstrate its performance on simulated genetic data and on a real data example. Fitting a ridge regression model to hundreds of thousands to millions of genetic variants simultaneously presents computational challenges. We have developed an R package, ridge, which addresses these issues. Ridge implements the automatic choice of ridge parameter presented in this paper, and is freely available from CRAN.

Journal ArticleDOI
TL;DR: The National Cancer Institute sponsored a "Gene-Environment Think Tank" on January 10-11, 2012 to facilitate discussions on the state of the science, the goals of G × E interaction studies in cancer epidemiology, and opportunities for developing novel study designs and analysis tools as mentioned in this paper.
Abstract: Cancer risk is determined by a complex interplay of genetic and environmental factors. Genome-wide association studies (GWAS) have identified hundreds of common (minor allele frequency [MAF] > 0.05) and less common (0.01 < MAF < 0.05) genetic variants associated with cancer. The marginal effects of most of these variants have been small (odds ratios: 1.1-1.4). There remain unanswered questions on how best to incorporate the joint effects of genes and environment, including gene-environment (G × E) interactions, into epidemiologic studies of cancer. To help address these questions, and to better inform research priorities and allocation of resources, the National Cancer Institute sponsored a "Gene-Environment Think Tank" on January 10-11, 2012. The objective of the Think Tank was to facilitate discussions on (1) the state of the science, (2) the goals of G × E interaction studies in cancer epidemiology, and (3) opportunities for developing novel study designs and analysis tools. This report summarizes the Think Tank discussion, with a focus on contemporary approaches to the analysis of G × E interactions. Selecting the appropriate methods requires first identifying the relevant scientific question and rationale, with an important distinction made between analyses aiming to characterize the joint effects of putative or established genetic and environmental factors and analyses aiming to discover novel risk factors or novel interaction effects. Other discussion items include measurement error, statistical power, significance, and replication. Additional designs, exposure assessments, and analytical approaches need to be considered as we move from the current small number of success stories to a fuller understanding of the interplay of genetic and environmental factors.

Journal ArticleDOI
TL;DR: In this paper, the authors developed broad classes of burden and kernel statistics, extending commonly used methods for unrelated case-control data to allow for known pedigree relationships, for autosomes and the X chromosome Furthermore, by replacing pedigree-based genetic correlation matrices with estimates of genetic relationships based on large-scale genomic data, their methods can be used to account for population-structured data.
Abstract: Searching for rare genetic variants associated with complex diseases can be facilitated by enriching for diseased carriers of rare variants by sampling cases from pedigrees enriched for disease, possibly with related or unrelated controls This strategy, however, complicates analyses because of shared genetic ancestry, as well as linkage disequilibrium among genetic markers To overcome these problems, we developed broad classes of “burden” statistics and kernel statistics, extending commonly used methods for unrelated case-control data to allow for known pedigree relationships, for autosomes and the X chromosome Furthermore, by replacing pedigree-based genetic correlation matrices with estimates of genetic relationships based on large-scale genomic data, our methods can be used to account for population-structured data By simulations, we show that the type I error rates of our developed methods are near the asymptotic nominal levels, allowing rapid computation of P-values Our simulations also show that a linear weighted kernel statistic is generally more powerful than a weighted “burden” statistic Because the proposed statistics are rapid to compute, they can be readily used for large-scale screening of the association of genomic sequence data with disease status

Journal ArticleDOI
TL;DR: Practical strategies for KM testing when multiple candidate kernels are present based on constructing composite kernels and based on efficient perturbation procedures are proposed and demonstrated to lead to substantially improved power over poor choices of kernels and only modest differences in power vs. using the best candidate kernel.
Abstract: Joint testing for the cumulative effect of multiple single-nucleotide polymorphisms grouped on the basis of prior biological knowledge has become a popular and powerful strategy for the analysis of large-scale genetic association studies. The kernel machine (KM)-testing framework is a useful approach that has been proposed for testing associations between multiple genetic variants and many different types of complex traits by comparing pairwise similarity in phenotype between subjects to pairwise similarity in genotype, with similarity in genotype defined via a kernel function. An advantage of the KM framework is its flexibility: choosing different kernel functions allows for different assumptions concerning the underlying model and can allow for improved power. In practice, it is difficult to know which kernel to use a priori because this depends on the unknown underlying trait architecture and selecting the kernel which gives the lowest P-value can lead to inflated type I error. Therefore, we propose practical strategies for KM testing when multiple candidate kernels are present based on constructing composite kernels and based on efficient perturbation procedures. We demonstrate through simulations and real data applications that the procedures protect the type I error rate and can lead to substantially improved power over poor choices of kernels and only modest differences in power vs. using the best candidate kernel.

Journal ArticleDOI
TL;DR: This work proposes to exploit an integer linear programming optimisation approach to pedigree learning, which is adapted to find valid pedigrees by imposing appropriate constraints, and is guaranteed to return a maximum likelihood pedigree.
Abstract: Large population biobanks of unrelated individuals have been highly successful in detecting common genetic variants affecting diseases of public health concern. However, they lack the statistical power to detect more modest gene-gene and gene-environment interaction effects or the effects of rare variants for which related individuals are ideally required. In reality, most large population studies will undoubtedly contain sets of undeclared relatives, or pedigrees. Although a crude measure of relatedness might sometimes suffice, having a good estimate of the true pedigree would be much more informative if this could be obtained efficiently. Relatives are more likely to share longer haplotypes around disease susceptibility loci and are hence biologically more informative for rare variants than unrelated cases and controls. Distant relatives are arguably more useful for detecting variants with small effects because they are less likely to share masking environmental effects. Moreover, the identification of relatives enables appropriate adjustments of statistical analyses that typically assume unrelatedness. We propose to exploit an integer linear programming optimisation approach to pedigree learning, which is adapted to find valid pedigrees by imposing appropriate constraints. Our method is not restricted to small pedigrees and is guaranteed to return a maximum likelihood pedigree. With additional constraints, we can also search for multiple high-probability pedigrees and thus account for the inherent uncertainty in any particular pedigree reconstruction. The true pedigree is found very quickly by comparison with other methods when all individuals are observed. Extensions to more complex problems seem feasible.

Journal ArticleDOI
TL;DR: It is shown that the F‐distributed tests of the proposed fixed effect functional linear models have higher power than that of sequence kernel association test ( SKAT) and its optimal unified test (SKAT‐O) for three scenarios in most cases, which can be rare variants or common variants or the combination of the two.
Abstract: Functional linear models are developed in this paper for testing associations between quantitative traits and genetic variants, which can be rare variants or common variants or the combination of the two. By treating multiple genetic variants of an individual in a human population as a realization of a stochastic process, the genome of an individual in a chromosome region is a continuum of sequence data rather than discrete observations. The genome of an individual is viewed as a stochastic function that contains both linkage and linkage disequilibrium (LD) information of the genetic markers. By using techniques of functional data analysis, both fixed and mixed effect functional linear models are built to test the association between quantitative traits and genetic variants adjusting for covariates. After extensive simulation analysis, it is shown that the F-distributed tests of the proposed fixed effect functional linear models have higher power than that of sequence kernel association test (SKAT) and its optimal unified test (SKAT-O) for three scenarios in most cases: (1) the causal variants are all rare, (2) the causal variants are both rare and common, and (3) the causal variants are common. The superior performance of the fixed effect functional linear models is most likely due to its optimal utilization of both genetic linkage and LD information of multiple genetic variants in a genome and similarity among different individuals, while SKAT and SKAT-O only model the similarities and pairwise LD but do not model linkage and higher order LD information sufficiently. In addition, the proposed fixed effect models generate accurate type I error rates in simulation studies. We also show that the functional kernel score tests of the proposed mixed effect functional linear models are preferable in candidate gene analysis and small sample problems. The methods are applied to analyze three biochemical traits in data from the Trinity Students Study.

Journal ArticleDOI
TL;DR: A novel method for inferring the local ancestry of admixed individuals from dense genome‐wide single nucleotide polymorphism data called MULTIMIX, which allows multiple source populations, models population linkage disequilibrium between markers and is applicable to datasets in which the sample and source populations are either phased or unphased.
Abstract: We describe a novel method for inferring the local ancestry of admixed individuals from dense genome-wide single nucleotide polymorphism data. The method, called MULTIMIX, allows multiple source populations, models population linkage disequilibrium between markers and is applicable to datasets in which the sample and source populations are either phased or unphased. The model is based upon a hidden Markov model of switches in ancestry between consecutive windows of loci. We model the observed haplotypes within each window using a multivariate normal distribution with parameters estimated from the ancestral panels. We present three methods to fit the model-Markov chain Monte Carlo sampling, the Expectation Maximization algorithm, and a Classification Expectation Maximization algorithm. The performance of our method on individuals simulated to be admixed with European and West African ancestry shows it to be comparable to HAPMIX, the ancestry calls of the two methods agreeing at 99.26% of loci across the three parameter groups. In addition to it being faster than HAPMIX, it is also found to perform well over a range of extent of admixture in a simulation involving three ancestral populations. In an analysis of real data, we estimate the contribution of European, West African and Native American ancestry to each locus in the Mexican samples of HapMap, giving estimates of ancestral proportions that are consistent with those previously reported.

Journal ArticleDOI
TL;DR: The findings indicate that African ancestry can account for, at least in part, the association between asthma and its associated trait, tIgE levels.
Abstract: Characterization of genetic admixture of populations in the Americas and the Caribbean is of interest for anthropological, epidemiological, and historical reasons. Asthma has a higher prevalence and is more severe in populations with a high African component. Association of African ancestry with asthma has been demonstrated. We estimated admixture proportions of samples from six trihybrid populations of African descent and determined the relationship between African ancestry and asthma and total serum IgE levels (tIgE). We genotyped 237 ancestry informative markers in asthmatics and nonasthmatic controls from Barbados (190/277), Jamaica (177/529), Brazil (40/220), Colombia (508/625), African Americans from New York (207/171), and African Americans from Baltimore/Washington, D.C. (625/757). We estimated individual ancestries and evaluated genetic stratification using Structure and principal component analysis. Association of African ancestry and asthma and tIgE was evaluated by regression analysis. Mean ± SD African ancestry ranged from 0.76 ± 0.10 among Barbadians to 0.33 ± 0.13 in Colombians. The European component varied from 0.14 ± 0.05 among Jamaicans and Barbadians to 0.26 ± 0.08 among Colombians. African ancestry was associated with risk for asthma in Colombians (odds ratio (OR) = 4.5, P = 0.001) Brazilians (OR = 136.5, P = 0.003), and African Americans of New York (OR: 4.7; P = 0.040). African ancestry was also associated with higher tIgE levels among Colombians (β = 1.3, P = 0.04), Barbadians (β = 3.8, P = 0.03), and Brazilians (β = 1.6, P = 0.03). Our findings indicate that African ancestry can account for, at least in part, the association between asthma and its associated trait, tIgE levels.

Journal ArticleDOI
TL;DR: In this paper, the authors show that either avoiding variable selection and instead testing the most informative principal components or integrating over variable selection using Bayesian model averaging can help control type 1 error rates.
Abstract: Integration of data from genome-wide single nucleotide polymorphism (SNP) association studies of different traits should allow researchers to disentangle the genetics of potentially related traits within individually associated regions. Formal statistical colocalisation testing of individual regions requires selection of a set of SNPs summarising the association in a region. We show that the SNP selection method greatly affects type 1 error rates, with published studies having used methods expected to result in substantially inflated type 1 error rates. We show that either avoiding variable selection and instead testing the most informative principal components or integrating over variable selection using Bayesian model averaging can help control type 1 error rates. Application to data from Graves' disease and Hashimoto's thyroiditis reveals a common genetic signature across seven regions shared between the diseases, and indicates that in five of six regions associated with Graves' disease and not Hashimoto's thyroiditis, this more likely reflects genuine absence of association with the latter rather than lack of power. Our examination, by simulation, of the performance of colocalisation tests and associated software will foster more widespread adoption of formal colocalisation testing. Given the increasing availability of large expression and genetic association datasets from disease-relevant tissue and purified cell populations, coupled with identification of regulatory sequences by projects such as ENCODE, colocalisation analysis has the potential to reveal both shared genetic signatures of related traits and causal disease genes and tissues.

Journal ArticleDOI
TL;DR: This framework is an adaptation of the sequence kernel association test (SKAT) which allows us to control for family structure and shows that regardless of the level of the trait heritability, this approach has good control of type I error and good power.
Abstract: Recent progress in sequencing technologies makes it possible to identify rare and unique variants that may be associated with complex traits. However, the results of such efforts depend crucially on the use of efficient statistical methods and study designs. Although family-based designs might enrich a data set for familial rare disease variants, most existing rare variant association approaches assume independence of all individuals. We introduce here a framework for association testing of rare variants in family-based designs. This framework is an adaptation of the sequence kernel association test (SKAT) which allows us to control for family structure. Our adjusted SKAT (ASKAT) combines the SKAT approach and the factored spectrally transformed linear mixed models (FaST-LMMs) algorithm to capture family effects based on a LMM incorporating the realized proportion of the genome that is identical by descent between pairs of individuals, and using restricted maximum likelihood methods for estimation. In simulation studies, we evaluated type I error and power of this proposed method and we showed that regardless of the level of the trait heritability, our approach has good control of type I error and good power. Since our approach uses FaST-LMM to calculate variance components for the proposed mixed model, ASKAT is reasonably fast and can analyze hundreds of thousands of markers. Data from the UK twins consortium are presented to illustrate the ASKAT methodology.

Journal ArticleDOI
TL;DR: A novel application of graph theory that identifies the maximum set of unrelated samples in any dataset given a user‐defined threshold of relatedness as well as all networks of related samples.
Abstract: Many statistical analyses of genetic data rely on the assumption of independence among samples. Consequently, relatedness is either modeled in the analysis or samples are removed to “clean” the data of any pairwise relatedness above a tolerated threshold. Current methods do not maximize the number of unrelated individuals retained for further analysis, and this is a needless loss of resources. We report a novel application of graph theory that identifies the maximum set of unrelated samples in any dataset given a user-defined threshold of relatedness as well as all networks of related samples. We have implemented this method into an open source program called Pedigree Reconstruction and Identification of a Maximum Unrelated Set, PRIMUS. We show that PRIMUS outperforms the three existing methods, allowing researchers to retain up to 50% more unrelated samples. A unique strength of PRIMUS is its ability to weight the maximum clique selection using additional criteria (e.g. affected status and data missingness). PRIMUS is a permanent solution to identifying the maximum number of unrelated samples for a genetic analysis.

Journal ArticleDOI
TL;DR: This paper proposes a generalized estimating equations (GEEs) based kernel association test, a variance component based testing method, to test for the association between a phenotype and multiple variants in an SNP set jointly using family samples, for both continuous and discrete traits.
Abstract: Family-based genetic association studies of related individuals provide opportunities to detect genetic variants that complement studies of unrelated individuals. Most statistical methods for family association studies for common variants are single marker based, which test one SNP a time. In this paper, we consider testing the effect of an SNP set, e.g., SNPs in a gene, in family studies, for both continuous and discrete traits. Specifically, we propose a generalized estimating equations (GEEs) based kernel association test, a variance component based testing method, to test for the association between a phenotype and multiple variants in an SNP set jointly using family samples. The proposed approach allows for both continuous and discrete traits, where the correlation among family members is taken into account through the use of an empirical covariance estimator. We derive the theoretical distribution of the proposed statistic under the null and develop analytical methods to calculate the P-values. We also propose an efficient resampling method for correcting for small sample size bias in family studies. The proposed method allows for easily incorporating covariates and SNP-SNP interactions. Simulation studies show that the proposed method properly controls for type I error rates under both random and ascertained sampling schemes in family studies. We demonstrate through simulation studies that our approach has superior performance for association mapping compared to the single marker based minimum P-value GEE test for an SNP-set effect over a range of scenarios. We illustrate the application of the proposed method using data from the Cleveland Family GWAS Study.

Journal ArticleDOI
TL;DR: The absence of a significant polygenic effect in this relatively large sample suggests an oligogenetic architecture for 25(OH)D, a modifiable trait linked with a growing number of chronic diseases.
Abstract: The primary circulating form of vitamin D is 25-hydroxy vitamin D (25(OH)D), a modifiable trait linked with a growing number of chronic diseases. In addition to environmental determinants of 25(OH)D, including dietary sources and skin ultraviolet B (UVB) exposure, twin- and family-based studies suggest that genetics contribute substantially to vitamin D variability with heritability estimates ranging from 43% to 80%. Genome-wide association studies (GWAS) have identified single nucleotide polymorphisms (SNPs) located in four gene regions associated with 25(OH)D. These SNPs collectively explain only a fraction of the heritability in 25(OH)D estimated by twin- and family-based studies. Using 25(OH)D concentrations and GWAS data on 5,575 subjects drawn from five cohorts, we hypothesized that genome-wide data, in the form of (1) a polygenic score comprised of hundreds or thousands of SNPs that do not individually reach GWAS significance, or (2) a linear mixed model for genome-wide complex trait analysis, would explain variance in measured circulating 25(OH)D beyond that explained by known genome-wide significant 25(OH)D-associated SNPs. GWAS identified SNPs explained 5.2% of the variation in circulating 25(OH)D in these samples and there was little evidence additional markers significantly improved predictive ability. On average, a polygenic score comprised of GWAS-identified SNPs explained a larger proportion of variation in circulating 25(OH)D than scores comprised of thousands of SNPs that were on average, nonsignificant. Employing a linear mixed model for genome-wide complex trait analysis explained little additional variability (range 0-22%). The absence of a significant polygenic effect in this relatively large sample suggests an oligogenetic architecture for 25(OH)D.

Journal ArticleDOI
TL;DR: The results of this case‐control study of Caucasian glioma cases and controls highlight that the deficiencies of the candidate‐gene approach lay in selecting both appropriate genes and relevant SNPs within these genes.
Abstract: Genomewide association studies (GWAS) and candidate-gene studies have implicated single-nucleotide polymorphisms (SNPs) in at least 45 different genes as putative glioma risk factors. Attempts to validate these associations have yielded variable results and few genetic risk factors have been consistently replicated. We conducted a case-control study of Caucasian glioma cases and controls from the University of California San Francisco (810 cases, 512 controls) and the Mayo Clinic (852 cases, 789 controls) in an attempt to replicate previously reported genetic risk factors for glioma. Sixty SNPs selected from the literature (eight from GWAS and 52 from candidate-gene studies) were successfully genotyped on an Illumina custom genotyping panel. Eight SNPs in/near seven different genes (TERT, EGFR, CCDC26, CDKN2A, PHLDB1, RTEL1, TP53) were significantly associated with glioma risk in the combined dataset (P 0.05). Although several confirmed associations are located near genes long known to be involved in gliomagenesis (e.g., EGFR, CDKN2A, TP53), these associations were first discovered by the GWAS approach and are in noncoding regions. These results highlight that the deficiencies of the candidate-gene approach lay in selecting both appropriate genes and relevant SNPs within these genes.

Journal ArticleDOI
TL;DR: Three loci associated with previously reported HF‐related metabolites were identified and a genetic risk score (GRS) created by summing the most significant risk alleles from each metabolite detected 11% greater risk of HF per allele.
Abstract: Both the prevalence and incidence of heart failure (HF) are increasing, especially among African Americans, but no large-scale, genome-wide association study (GWAS) of HF-related metabolites has been reported. We sought to identify novel genetic variants that are associated with metabolites previously reported to relate to HF incidence. GWASs of three metabolites identified previously as risk factors for incident HF (pyroglutamine, dihydroxy docosatrienoic acid, and X-11787, being either hydroxy-leucine or hydroxy-isoleucine) were performed in 1,260 African Americans free of HF at the baseline examination of the Atherosclerosis Risk in Communities (ARIC) study. A significant association on chromosome 5q33 (rs10463316, MAF = 0.358, P-value = 1.92 × 10(-10) ) was identified for pyroglutamine. One region on chromosome 2p13 contained a nonsynonymous substitution in N-acetyltransferase 8 (NAT8) was associated with X-11787 (rs13538, MAF = 0.481, P-value = 1.71 × 10(-23) ). The smallest P-value for dihydroxy docosatrienoic acid was rs4006531 on chromosome 8q24 (MAF = 0.400, P-value = 6.98 × 10(-7) ). None of the above SNPs were individually associated with incident HF, but a genetic risk score (GRS) created by summing the most significant risk alleles from each metabolite detected 11% greater risk of HF per allele. In summary, we identified three loci associated with previously reported HF-related metabolites. Further use of metabolomics technology will facilitate replication of these findings in independent samples.

Journal ArticleDOI
TL;DR: SBERIA takes advantage of the established correlation screening for G × E to guide the aggregation of genotypes within a marker set and showed that SBERIA had higher power than benchmark methods in various simulation scenarios, both for common and rare variants.
Abstract: Identification of gene-environment interaction (G × E) is important in understanding the etiology of complex diseases. However, partially due to the lack of power, there have been very few replicated G × E findings compared to the success in marginal association studies. The existing G × E testing methods mainly focus on improving the power for individual markers. In this paper, we took a different strategy and proposed a set-based gene-environment interaction test (SBERIA), which can improve the power by reducing the multiple testing burdens and aggregating signals within a set. The major challenge of the signal aggregation within a set is how to tell signals from noise and how to determine the direction of the signals. SBERIA takes advantage of the established correlation screening for G × E to guide the aggregation of genotypes within a marker set. The correlation screening has been shown to be an efficient way of selecting potential G × E candidate SNPs in case-control studies for complex diseases. Importantly, the correlation screening in case-control combined samples is independent of the interaction test. With this desirable feature, SBERIA maintains the correct type I error level and can be easily implemented in a regular logistic regression setting. We showed that SBERIA had higher power than benchmark methods in various simulation scenarios, both for common and rare variants. We also applied SBERIA to real genome-wide association studies (GWAS) data of 10,729 colorectal cancer cases and 13,328 controls and found evidence of interaction between the set of known colorectal cancer susceptibility loci and smoking.

Journal ArticleDOI
TL;DR: The application of the new methods to the meta‐analysis of five major cardiovascular cohort studies identifies a new locus (HSCB) that is pleiotropic for the four traits analyzed.
Abstract: Genetic association studies often collect data on multiple traits that are correlated. Discovery of genetic variants influencing multiple traits can lead to better understanding of the etiology of complex human diseases. Conventional univariate association tests may miss variants that have weak or moderate effects on individual traits. We propose several multivariate test statistics to complement univariate tests. Our framework covers both studies of unrelated individuals and family studies and allows any type/mixture of traits. We relate the marginal distributions of multivariate traits to genetic variants and covariates through generalized linear models without modeling the dependence among the traits or family members. We construct score-type statistics, which are computationally fast and numerically stable even in the presence of covariates and which can be combined efficiently across studies with different designs and arbitrary patterns of missing data. We compare the power of the test statistics both theoretically and empirically. We provide a strategy to determine genome-wide significance that properly accounts for the linkage disequilibrium (LD) of genetic variants. The application of the new methods to the meta-analysis of five major cardiovascular cohort studies identifies a new locus (HSCB) that is pleiotropic for the four traits analyzed.

Journal ArticleDOI
TL;DR: It is shown that, different from single‐SNP inference, genes with diverse composition of rare and common variants may suffer from population stratification to various extent, and caution needs to be exercised when using principal component adjustment.
Abstract: Accurate genetic association studies are crucial for the detection and the validation of disease determinants. One of the main confounding factors that affect accuracy is population stratification, and great efforts have been extended for the past decade to detect and to adjust for it. We have now efficient solutions for population stratification adjustment for single-SNP (where SNP is single-nucleotide polymorphisms) inference in genome-wide association studies, but it is unclear whether these solutions can be effectively applied to rare variation studies and in particular gene-based (or set-based) association methods that jointly analyze multiple rare and common variants. We examine here, both theoretically and empirically, the performance of two commonly used approaches for population stratification adjustment—genomic control and principal component analysis—when used on gene-based association tests. We show that, different from single-SNP inference, genes with diverse composition of rare and common variants may suffer from population stratification to various extent. The inflation in gene-level statistics could be impacted by the number and the allele frequency spectrum of SNPs in the gene, and by the gene-based testing method used in the analysis. As a consequence, using a universal inflation factor as a genomic control should be avoided in gene-based inference with sequencing data. We also demonstrate that caution needs to be exercised when using principal component adjustment because the accuracy of the adjusted analyses depends on the underlying population substructure, on the way the principal components are constructed, and on the number of principal components used to recover the substructure.

Journal ArticleDOI
TL;DR: It is shown that a few top PCs based on either CVs or LFVs could separate two continental groups, European and African samples, but those based on only RVs performed less well, and the use of the former could lead to overadjustment in the sense of substantial power loss in the absence of population stratification.
Abstract: For unrelated samples, principal component (PC) analysis has been established as a simple and effective approach to adjusting for population stratification in association analysis of common variants (CVs, with minor allele frequencies MAF > 5%). However, it is less clear how it would perform in analysis of low-frequency variants (LFVs, MAF between 1% and 5%), or of rare variants (RVs, MAF < 5%). Furthermore, with next-generation sequencing data, it is unknown whether PCs should be constructed based on CVs, LFVs, or RVs. In this study, we used the 1000 Genomes Project sequence data to explore the construction of PCs and their use in association analysis of LFVs or RVs for unrelated samples. It is shown that a few top PCs based on either CVs or LFVs could separate two continental groups, European and African samples, but those based on only RVs performed less well. When applied to several association tests in simulated data with population stratification, using PCs based on either CVs or LFVs was effective in controlling Type I error rates, while nonadjustment led to inflated Type I error rates. Perhaps the most interesting observation is that, although the PCs based on LFVs could better separate the two continental groups than those based on CVs, the use of the former could lead to overadjustment in the sense of substantial power loss in the absence of population stratification; in contrast, we did not see any problem with the use of the PCs based on CVs in all our examples.