scispace - formally typeset
Search or ask a question

Showing papers by "Gonçalo R. Abecasis published in 2006"


Journal ArticleDOI
TL;DR: It is demonstrated that the alternative strategy of jointly analyzing the data from both stages almost always results in increased power to detect genetic association, despite the need to use more stringent significance levels, even when effect sizes differ between the two stages.
Abstract: Genome-wide association is a promising approach to identify common genetic variants that predispose to human disease. Because of the high cost of genotyping hundreds of thousands of markers on thousands of subjects, genome-wide association studies often follow a staged design in which a proportion (pi(samples)) of the available samples are genotyped on a large number of markers in stage 1, and a proportion (pi(samples)) of these markers are later followed up by genotyping them on the remaining samples in stage 2. The standard strategy for analyzing such two-stage data is to view stage 2 as a replication study and focus on findings that reach statistical significance when stage 2 data are considered alone. We demonstrate that the alternative strategy of jointly analyzing the data from both stages almost always results in increased power to detect genetic association, despite the need to use more stringent significance levels, even when effect sizes differ between the two stages. We recommend joint analysis for all two-stage genome-wide association studies, especially when a relatively large proportion of the samples are genotyped in stage 1 (pi(samples) >or= 0.30), and a relatively large proportion of markers are selected for follow-up in stage 2 (pi(markers) >or= 0.01).

1,283 citations


Journal ArticleDOI
TL;DR: Results strongly suggest that HLA-Cw6 is the PSORS1 risk allele that confers susceptibility to early-onset psoriasis.
Abstract: Previous studies have narrowed the interval containing PSORS1, the psoriasis-susceptibility locus in the major histocompatibility complex (MHC), to an ∼300-kb region containing HLA-C and at least 10 other genes. In an effort to identify the PSORS1 gene, we cloned and completely sequenced this region from both chromosomes of five individuals. Two of the sequenced haplotypes were associated with psoriasis (risk), and the other eight were clearly unassociated (nonrisk). Comparison of sequence of the two risk haplotypes identified a 298-kb region of homology, extending from just telomeric of HLA-B to the HCG22 gene, which was flanked by clearly nonhomologous regions. Similar haplotypes cloned from unrelated individuals had nearly identical sequence. Combinatorial analysis of exonic variations in the known genes of the candidate interval revealed that HCG27, PSORS1C3, OTF3, TCF19, HCR, STG, and HCG22 bore no alleles unique to risk haplotypes among the 10 sequenced haplotypes. SPR1 and SEEK1 both had messenger RNA alleles specific to risk haplotypes, but only HLA-C and CDSN yielded protein alleles unique to risk. The risk alleles of HLA-C and CDSN (HLA-Cw6 and CDSN*TTC) were genotyped in 678 families with early-onset psoriasis; 620 of these families were also typed for 34 microsatellite markers spanning the PSORS1 interval. Recombinant haplotypes retaining HLA-Cw6 but lacking CDSN*TTC were significantly associated with psoriasis, whereas recombinants retaining CDSN*TTC but lacking HLA-Cw6 were not associated, despite good statistical power. By grouping recombinants with similar breakpoints, the most telomeric quarter of the 298-kb candidate interval could be excluded with high confidence. These results strongly suggest that HLA-Cw6 is the PSORS1 risk allele that confers susceptibility to early-onset psoriasis.

553 citations


Journal ArticleDOI
TL;DR: It is suggested that there are multiple disease susceptibility alleles in the region and that noncoding CFH variants play a role in disease susceptibility.
Abstract: In developed countries, age-related macular degeneration is a common cause of blindness in the elderly. A common polymorphism, encoding the sequence variation Y402H in complement factor H (CFH), has been strongly associated with disease susceptibility. Here, we examined 84 polymorphisms in and around CFH in 726 affected individuals (including 544 unrelated individuals) and 268 unrelated controls. In this sample, 20 of these polymorphisms showed stronger association with disease susceptibility than the Y402H variant. Further, no single polymorphism could account for the contribution of the CFH locus to disease susceptibility. Instead, multiple polymorphisms defined a set of four common haplotypes (of which two were associated with disease susceptibility and two seemed to be protective) and multiple rare haplotypes (associated with increased susceptibility in aggregate). Our results suggest that there are multiple disease susceptibility alleles in the region and that noncoding CFH variants play a role in disease susceptibility.

360 citations


Journal ArticleDOI
TL;DR: A comprehensive assessment of the methods applied to both trios and to unrelated individuals, with a focus on genomic-scale problems, concludes that all the methods considered will provide highly accurate estimates of haplotypes when applied to trio data sets.
Abstract: Knowledge of haplotype phase is valuable for many analysis methods in the study of disease, population, and evolutionary genetics. Considerable research effort has been devoted to the development of statistical and computational methods that infer haplotype phase from genotype data. Although a substantial number of such methods have been developed, they have focused principally on inference from unrelated individuals, and comparisons between methods have been rather limited. Here, we describe the extension of five leading algorithms for phase inference for handling father-mother-child trios. We performed a comprehensive assessment of the methods applied to both trios and to unrelated individuals, with a focus on genomic-scale problems, using both simulated data and data from the HapMap project. The most accurate algorithm was PHASE (v2.1). For this method, the percentages of genotypes whose phase was incorrectly inferred were 0.12%, 0.05%, and 0.16% for trios from simulated data, HapMap Centre d'Etude du Polymorphisme Humain (CEPH) trios, and HapMap Yoruban trios, respectively, and 5.2% and 5.9% for unrelated individuals in simulated data and the HapMap CEPH data, respectively. The other methods considered in this work had comparable but slightly worse error rates. The error rates for trios are similar to the levels of genotyping error and missing data expected. We thus conclude that all the methods considered will provide highly accurate estimates of haplotypes when applied to trio data sets. Running times differ substantially between methods. Although it is one of the slowest methods, PHASE (v2.1) was used to infer haplotypes for the 1 million–SNP HapMap data set. Finally, we evaluated methods of estimating the value of r2 between a pair of SNPs and concluded that all methods estimated r2 well when the estimated value was ⩾0.8.

322 citations


Journal ArticleDOI
TL;DR: The genotype inference method is illustrated by inferring over 53 million SNP genotypes for 78 children in the Centre d'Etude du Polymorphisme Humain families, showing its utility in obtaining high-density genotypes in different family structures.
Abstract: Our genotype inference method combines sparse marker data from a linkage scan and high-resolution SNP genotypes for several individuals to infer genotypes for related individuals. We illustrate the method's utility by inferring over 53 million SNP genotypes for 78 children in the Centre d'Etude du Polymorphisme Humain families. The method can be used to obtain high-density genotypes in different family structures, including nuclear families commonly used in complex disease gene mapping studies.

144 citations


Journal ArticleDOI
TL;DR: The likelihood-based method of Li et al., which assesses whether there is linkage disequilibrium between a disease locus and a SNP is extended to accommodate sibships of arbitrary size and disease-phenotype configuration, suggests that when the disease is influenced by a single gene, the one sibling per ASP-control design is the most efficient.
Abstract: Linkage mapping of complex diseases is often followed by association studies between phenotypes and marker genotypes through use of case-control or family-based designs. Given fixed genotyping resources, it is important to know which study designs are the most efficient. To address this problem, we extended the likelihood-based method of Li et al., which assesses whether there is linkage disequilibrium between a disease locus and a SNP, to accommodate sibships of arbitrary size and disease-phenotype configuration. A key advantage of our method is the ability to combine data from different family structures. We consider scenarios for which genotypes are available for unrelated cases, affected sib pairs (ASPs), or only one sibling per ASP. We construct designs that use cases only and others that use unaffected siblings or unrelated unaffected individuals as controls. Different combinations of cases and controls result in seven study designs. We compare the efficiency of these designs when the number of individuals to be genotyped is fixed. Our results suggest that (1) when the disease is influenced by a single gene, the one sibling per ASP–control design is the most efficient, followed by the ASP-control design, and familial cases contribute more association information than singleton cases; (2) when the disease is influenced by multiple genes, familial cases provide more association information than singleton cases, unless the effect of the locus being tested is much smaller than at least one other untested disease locus; and (3) the case-control design can be useful for detecting genes with small effect in the presence of genes with much larger effect. Our findings will be helpful for researchers designing and analyzing complex disease-association studies and will facilitate genotyping resource allocation.

111 citations


Journal ArticleDOI
TL;DR: An improved algorithm for tagSNP selection using the pairwise r(2) criterion is devised, which first break down large marker sets into disjoint pieces, where more exhaustive searches can replace the greedy algorithm.
Abstract: Motivation: Selecting SNP markers for genome-wide association studies is an important and challenging task. The goal is to minimize the number of markers selected for genotyping in a particular platform and therefore reduce genotyping cost while simultaneously maximizing the information content provided by selected markers. Results: We devised an improved algorithm for tagSNP selection using the pairwise r2 criterion. We first break down large marker sets into disjoint pieces, where more exhaustive searches can replace the greedy algorithm for tagSNP selection. These exhaustive searches lead to smaller tagSNP sets being generated. In addition, our method evaluates multiple solutions that are equivalent according to the linkage disequilibrium criteria to accommodate additional constraints. Its performance was assessed using HapMap data. Availability: A computer program named FESTA has been developed based on this algorithm. The program is freely available and can be downloaded at http://www.sph.umich.edu/csg/qin/FESTA/ Contact: qin@umich.edu Supplementary information: http://www.sph.umich.edu/csg/qin/FESTA/

65 citations


Journal ArticleDOI
01 Aug 2006-Genetics
TL;DR: A modified VC method is developed and implemented that directly models the nonnormal distribution using Gaussian copulas and yields unbiased parameter estimates, correct type I error rates, and improved power for testing linkage with a variety of nonnormal traits as compared with the standard VC and the regression-based methods.
Abstract: Mapping and identifying variants that influence quantitative traits is an important problem for genetic studies. Traditional QTL mapping relies on a variance-components (VC) approach with the key assumption that the trait values in a family follow a multivariate normal distribution. Violation of this assumption can lead to inflated type I error, reduced power, and biased parameter estimates. To accommodate nonnormally distributed data, we developed and implemented a modified VC method, which we call the “copula VC method,” that directly models the nonnormal distribution using Gaussian copulas. The copula VC method allows the analysis of continuous, discrete, and censored trait data, and the standard VC method is a special case when the data are distributed as multivariate normal. Through the use of link functions, the copula VC method can easily incorporate covariates. We use computer simulations to show that the proposed method yields unbiased parameter estimates, correct type I error rates, and improved power for testing linkage with a variety of nonnormal traits as compared with the standard VC and the regression-based methods.

38 citations


Journal ArticleDOI
TL;DR: Although their analytical calculations are for a fully informative marker locus, in the settings the authors examined power was similar to what could be attained with a single nucleotide polymorphism (SNP) mapping panel (with >1 SNP/cM).
Abstract: Variance component linkage analysis is commonly used to map quantitative trait loci (QTLs) in general pedigrees. Large pedigrees are especially attractive for these studies because they provide greater power per genotyped individual than small pedigrees. We propose accurate and computationally efficient methods to calculate the analytical power of variance component linkage analysis that can accommodate large pedigrees. Our analytical power computation involves the approximation of the noncentrality parameter for the likelihood-ratio test by its Taylor expansions. We develop efficient algorithms to compute the second and third moments of the identical by descent (IBD) sharing distribution and enable rapid computation of the Taylor expansions. Our algorithms take advantage of natural symmetries in pedigrees and can accurately analyze many large pedigrees in a few seconds. We verify the accuracy of our power calculation via simulation in pedigrees with 2-5 generations and 2-8 siblings per sibship. We apply this proposed analytical power calculation to 98 quantitative traits in a cohort study of 6,148 Sardinians in which the largest pedigree includes 625 phenotyped individuals. Simulations based on eight representative traits show that the difference between our analytical estimation of the expected LOD score and the average of simulated LOD scores is less than 0.05 (1.5%). Although our analytical calculations are for a fully informative marker locus, in the settings we examined power was similar to what could be attained with a single nucleotide polymorphism (SNP) mapping panel (with >1 SNP/cM). Our algorithms for power analysis together with polygenic analysis are implemented in a freely available computer program, POLY.

36 citations


Journal ArticleDOI
TL;DR: The results suggest that when possible, sex‐specific maps should be used in linkage analyses for multipoint linkage analysis of affected sibling pairs when identity‐by‐descent states are incompletely known due to missing parental genotypes and incomplete marker heterozygosity.
Abstract: The ratio of male and female genetic map distances varies dramatically across the human genome. Despite these sex differences in genetic map distances, most multipoint linkage analyses use sex-averaged genetic maps. We investigated the impact of using a sex-averaged genetic map instead of sex-specific maps for multipoint linkage analysis of affected sibling pairs when identity-by-descent states are incompletely known due to missing parental genotypes and incomplete marker heterozygosity. If either all or no parental genotypes were available, for intermarker distances of 10, 5, and 1cM, we found no important differences in the expected maximum lod score (EMLOD) or location estimates of the disease locus between analyses that used the sex-averaged map and those that used the true sex-specific maps for female:male genetic map distance ratios 1:10 and 10:1. However, when genotypes for only one parent were available and the recombination rate was higher in females, the EMLOD using the sex-averaged map was inflated compared to the sex-specific map analysis if only mothers were genotyped and deflated if only fathers were genotyped. The inflation of the lod score when only mothers were genotyped led to markedly increased false-positive rates in some cases. The opposite was true when the recombination rate was higher in males; the EMLOD was inflated if only fathers were genotyped, and deflated if only mothers were genotyped. While the effects of missing parental genotypes were mitigated for less extreme cases of missingness, our results suggest that when possible, sex-specific maps should be used in linkage analyses. Genet. Epidemiol. 30:384–396, 2006. r 2006 Wiley-Liss, Inc.

22 citations


Journal ArticleDOI
TL;DR: In Table 1 of the versions of this article initially published online and in print, the significance thresholds for C2 were incorrect, and the significance threshold for Cjoint in the case of πsamples = 0.20 were incorrect.
Abstract: Nat Genet 38, 209–213 (2006) In Table 1 of the versions of this article initially published online and in print, the significance thresholds for C2 were incorrect, and the significance thresholds for Cjoint in the case of πsamples = 020 were incorrect The error has been corrected in the HTML and PDF versions of the article

Journal ArticleDOI
TL;DR: This study uses simulations to explore properties of the replicate pool p‐value estimator p̂RP and shows that it provides an excellent approximation to the traditional gene‐dropping estimator for significantly less computational effort.
Abstract: The calculation of empirical p-values for genome-wide non-parametric linkage tests continues to present significant computational challenges for many complex disease mapping studies. The gold standard approach is to use gene dropping to simulate null genome scans. Unfortunately, this approach is too computationally expensive for many data sets of interest. An alternative, more efficient method for sampling null genome scans is to pre-calculate pools of family-specific statistics and then resample from these replicate pools to generate ‘‘pseudo-replicate’’ genome scans. In this study, we use simulations to explore properties of the replicate pool p-value estimator ^ pRP and show that it provides an excellent approximation to the traditional gene-dropping estimator for significantly less computational effort. While the computational efficiency of the replicate pool estimator is noticeable in almost all data sets, by applying the replicate pool method to several previously characterized data sets we show that savings in computational effort can be especially significant (on the order of 10,000fold or more) when one or more large families are analyzed. We also estimate replicate pool p-values for the schizophrenia data described by Abecasis et al. and show that ^RP closely approximates gene-drop p-values for all linkage peaks reported for this study. Lastly, we expand upon Song et al.’s previous work by deriving a conservative estimator of the variance for ^ pRP that can easily be computed in practical settings. We have implemented the replicate pool method along with our variance estimator in a new program called Pseudo, which is the first widely available automated implementation of the replicate pool method. Genet. Epidemiol. 30:320–332, 2006. r 2006 Wiley-Liss, Inc.