What are the main resources of the TOPMed program?

Members of the broader scientific community are using TOPMed resources through the WGS and phenotype data available on dbGaP, the BRAVO variant server and the imputation reference panel on the TOPMed imputation server.

How can the authors enhance the analysis of any genotyped samples?

In addition to enabling detailed analysis of TOPMed sequenced samples, TOPMed can enhance the analysis of any genotyped samples72.

What are the main uses of TOPMed data?

In addition to these uses, the authors expect that TOPMed data will improve nearly all ongoing studies of common and rare disorders by providing both a deep catalogue of variation in healthy individuals and an imputation resource that enables array-based studies to achieve a completeness that was previously attainable only through direct sequencing.

How many rare variants can be recovered from TOPMed?

This means that 89% of the approximately 80,000 rare variants with MAF < 0.5% in an average genome of African ancestry can be recovered through genotype imputation using the TOPMed panel.

What are the common groups of African American and Caribbean populations?

As expected, African American and Caribbean population groups have the greatest heterozygosity7,47, followed by Hispanic/Latino, European American, Amish, East Asian and Samoan groups.

(Open Access) Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program (2019) | Daniel Taliun

Q: How many variants were detected in the first 53,831 TOPMed samples?

In the first 53,831 TOPMed samples, the authors detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome.

Q: What is the significance of the clustering of SNVs?

Insights into mutation processes A hallmark of human genetic variation is that SNVs tend to cluster together throughout the genome3,28.

Q: How many simulated singletons were less than 100 bp apart?

In coalescent simulations (see Methods), only 0.16% of the simulated singletons within an individual were less than 100 bp apart (Supplementary Figs. 19, 20).

Q: What are the common groups of African American and Caribbean populations?

As expected, African American and Caribbean population groups have the greatest heterozygosity7,47, followed by Hispanic/Latino, European American, Amish, East Asian and Samoan groups.

UMass Chan Medical School UMass Chan Medical School

eScholarship@UMassChan eScholarship@UMassChan

Open Access Publications by UMMS Authors

2021-02-10

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Sequencing of 53,831 diverse genomes from the NHLBI TOPMed

Program Program

Daniel Taliun

University of Michigan

Et al.

Let us know how access to this document bene>ts you.

Follow this and additional works at: https://escholarship.umassmed.edu/oapubs

Part of the Genomics Commons, and the Population Biology Commons

Repository Citation Repository Citation

Taliun D, McManus DD, Cupples LA, Laurie CC, Jaquish CE, Hernandez RD, O'Connor TD, Abecasis GR.

(2021). Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Open Access

Publications by UMMS Authors. https://doi.org/10.1038/s41586-021-03205-y. Retrieved from

https://escholarship.umassmed.edu/oapubs/4616

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

This material is brought to you by eScholarship@UMassChan. It has been accepted for inclusion in Open Access

Publications by UMMS Authors by an authorized administrator of eScholarship@UMassChan. For more

information, please contact Lisa.Palmer@umassmed.edu.

290 | Nature | Vol 590 | 11 February 2021

Article

Sequencing of 53,831 diverse genomes from

the NHLBI TOPMed Program

The Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the

genetic architecture and biology of heart, lung, blood and sleep disorders, with the

ultimate goal of improving diagnosis, treatment and prevention of these diseases. The

initial phases of the programme focused on whole-genome sequencing of individuals

with rich phenotypic data and diverse backgrounds. Here we describe the TOPMed

goals and design as well as the available resources and early insights obtained from

the sequence data. The resources include a variant browser, a genotype imputation

server, and genomic and phenotypic data that are available through dbGaP (Database

of Genotypes and Phenotypes)

. In the rst 53,831 TOPMed samples, we detected

more than 400million single-nucleotide and insertion or deletion variants after

alignment with the reference genome. Additional previously undescribed variants

were detected through assembly of unmapped reads and customized analysis in

highly variable loci. Among the more than 400million detected variants, 97% have

frequencies of less than 1% and 46% are singletons that are present in only one

individual (53% among unrelated individuals). These rare variants provide insights

into mutational processes and recent human evolutionary history. The extensive

catalogue of genetic variation in TOPMed studies provides unique opportunities for

exploring the contributions of rare and noncoding sequence variants to phenotypic

variation. Furthermore, combining TOPMed haplotypes with modern imputation

methods improves the power and reach of genome-wide association studies to

include variants down to a frequency of approximately 0.01%.

Advancing DNA-sequencing technologies and decreasing costs are

enabling researchers to explore human genetic variation at an unprec-

edented scale

2,3

. For these advances to improve our understanding of

human health, they must be deployed in well-phenotyped human sam-

ples and used to build resources such as variation catalogues

3,4

, control

collections

5,6

and imputation reference panels

7–9

. Here we describe

high-coverage whole-genome sequencing (WGS) analyses of the first

53,831 TOPMed samples (Box1 and Extended Data Tables1, 2); addi-

tional data are being made available as quality control, variant calling

and dbGaP curation are completed (altogether more than 130,000

TOPMed samples are now available in dbGaP).

A key goal of the TOPMed programme is to understand risk factors

for heart, lung, blood and sleep disorders by adding WGS and other

‘omics’ data to existing studies with deep phenotyping (Supplementary

Information1.1 and Supplementary Fig.1). The programme currently

consists of more than 80 participating studies, around 1,000 investi-

gators and more than 30 working groups (https://www.nhlbiwgs.org/

working-groups-public). TOPMed participants are ethnically and ances-

trally diverse (Extended Data Fig.1, Supplementary Information1.1.4

and Supplementary Fig.2). Through a combination of race and ethnicity

information (from participant questionnaires and/or study inclusion

criteria), we classified study participants into ‘population groups’,

which varied in composition according to the goals of each analysis.

In some analyses, these groups were further refined using genetic

ancestry (seeMethods and Supplementary Information for details).

Our study extends previous efforts by identifying and character-

izing the rare variants that comprise the majority of human genomic

variation

7,10–12

. Rare variants represent recent and potentially deleteri-

ous changes that can affect protein function, gene expression or other

biologically important elements

11,13,14

TOPMed WGS quality assessment

WGS of the TOPMed samples was performed over multiple studies,

years and sequencing centres. To minimize batch effects, we stand-

ardized laboratory methods, mapped and processed sequence data

centrally using a single pipeline, and performed variant calling and

genotyping jointly across all samples (seeMethods). We annotated each

variant site with multiple sequence quality metrics and trained machine

learning filters to identify and exclude inconsistencies that are revealed

when the same individual was sequenced repeatedly. Available WGS

data were processed periodically to produce genotype data ‘freezes’.

The 53,831 samples described here are drawn from TOPMed freeze 5.

Stringent variant and sample quality filters were applied and the

resulting genotype call sets were evaluated in several ways (Supple-

mentary Information1.2.2, 1.3, 1.4). First, we compared genotypes

for samples sequenced in duplicate (the mean alternative allele

concordance was 0.9995 for single-nucleotide variants (SNVs) and

0.9930 for insertions or deletions (indels)). Second, we compared

genotypes to those from previous whole-exome sequencing datasets

(protein-coding regions from GENCODE

; 80% of variants were found

with both approaches and overlapping variant calls had a concordance

of 0.9993 for SNVs and 0.9974 for indels) (Supplementary Tables1–3).

Third, we compared genotypes to those obtained using alternative

https://doi.org/10.1038/s41586-021-03205-y

Received: 6 March 2019

Accepted: 7 January 2021

Published online: 10 February 2021

Open access

Check for updates

A list of authors and their afiliations appears at the end of the paper.

Nature | Vol 590 | 11 February 2021 | 291

informatics tools (compared to GATK v.4.1.3, TOPMed has lower Mende-

lian inconsistency rates and minimizes batch effects) (Supplementary

Table4). These reproducibility estimates indicate the high quality

of the genotype calls and effectiveness of machine-learning-based

quality filters.

Batch effects were evaluated by (1) comparing distributions of

genetic principal components among sequencing centres, which

are very similar between European American and African American

individuals (Supplementary Figs.3–5); (2) comparing alternative

allele concordance between duplicates among centres, which is high

(the largest difference being 4×10

−4

), and the patterns of between-

versus within-centre differences, which indicate random errors rather

than systematic centre differences (Supplementary Figs.6–8); and (3)

performing tests of association between variants and batches, which

show a very small fraction of variants with genome-wide significance

(0.004%, Supplementary Figs.9, 10) (Supplementary Information1.2).

We conclude that batch effects appear to be minor, thus enabling

multi-study association testing.

410 million genetic variants in 53,831 samples

A total of 7.0×10

bases of DNA-sequencing data were generated,

consisting of an average of 129.6×10

bases of sequence distributed

across 864.2million paired reads (each 100–151base pairs (bp) long)

per individual. For a typical individual, 99.65% of the bases in the refer-

ence genome were covered, to a mean read depth of 38.2×.

Sequence analysis identified 410,323,831 genetic variants

(381,343,078 SNVs and 28,980,753 indels), corresponding to an aver-

age of one variant per 7bp (Extended Data Table4). Overall, 78.7% of

these variants had not been described in dbSNP build 149; TOPMed

variants now account for the majority of variants in dbSNP. Among all

variant alleles, 46.0% were singletons, observed once across all 53,831

participants. Among 40,722 unrelated participants (seeMethods), the

proportion of singleton variants was higher at 53.1% (Table1). Down-

sampling analyses show that the proportion of singletons increases

until around 15,000 unrelated individuals are sequenced and then

decreases very gradually (Supplementary Fig.11). The fraction of

singletons in each region or class of sites closely tracks functional

constraints. For example, among all 4,651,453 protein-coding variants

in unrelated individuals, the proportion of singletons was the highest

for the 104,704 frameshift variants (68.4%), high among the 97,217

putative splice and truncation variants (62.1%), intermediate among

the 2,965,093 nonsynonymous variants (55.6%) and lowest among

the 1,435,058 synonymous variants (49.8%). Beyond protein-coding

sequences, we found increased proportions of singletons in promoters

(55.0%), 5′ untranslated regions (54.7%), regions of open chromatin

(53.4%) and 3′ untranslated regions (53.3%); we found lower propor-

tions of singletons in intergenic regions (53.0%) (Supplementary

Table5). Although putative transcription factor binding sites initially

appeared to show fewer singletons (52.7%) than the remainder of the

genome (53.1%), this pattern did not hold when we analysed highly

mutable CpG sites separately. In fact, transcription factor binding sites

were enriched for singletons in both CpG sites and non-CpG sites, an

example of Simpson’s paradox

We identified an average of 3.78million variants in each genome.

Among these, an average of 30,207 (0.8%) were novel and 3,510 (0.1%)

were singletons. Among all variants, we observed 3.17million non-

synonymous and 1.53million synonymous variants (a 2.1:1 ratio), but

individual genomes contained similar numbers of nonsynonymous and

synonymous variants (11,743 nonsynonymous and 11,768 synonymous,

on average) (Extended Data Table4). The difference can be explained

if more than half of the nonsynonymous variants are removed from

the populationby natural selection before they become common.

Putative loss-of-function variants

A notable class of variants is the 228,966 putative loss-of-function

(pLOF) variants that we observed in 18,493 (95.0%) GENCODE

genes

(Extended Data Table5 and Supplementary Fig.12). This class includes

the highest proportion of singletons among all of the variant classes

that we examined. An average individual carried 2.5unique pLOF vari-

ants. We identified more pLOF variants per individual than in previ-

ous surveys based on exome sequencing—an increase that was mainly

driven by the identification of additional frameshift variants (Sup-

plementary Table6) and by a more uniform and complete coverage

of protein-coding regions (Supplementary Figs.13, 14).

We searched for gene sets with fewer rare pLOF variants than

expected based on gene size. The gene sets with strong functional

constraint included genes that encode DNA- and RNA-binding pro-

teins, spliceosomal complexes, translation initiation machinery and

Box 1

TOPMed participant consents

and data access

The TOPMed programme comprises more than 80 participating

studies, of which 32 are represented in the 53,831 whole genomes

described here. TOPMed has leveraged existing studies with deep

phenotyping and longitudinal follow-up data and with varied

informed consent procedures and options. Consent groups

range from broad ‘general research use’ and ‘health, medical

and biomedical’ categories to disease-speciic categories for

heart, lung, blood and/or sleep disorders. Many studies have

further consent modiiers, such as limiting use to not-for-proit

organizations or requiring documentation of local IRB approval.

Participant consents guide the appropriate use of data by TOPMed

investigators as well; therefore, the set of study-consent groups

used varies across different analyses reported in this paper

(Extended Data Table3).

TOPMed data have been deposited in dbGaP and access is

adjudicated by a staff committee of the National Institutes of

Health. The committee veriies that applications are consistent

with data use limitations and consent groups for each sample.

Study investigators have no role in the decision, except in a

small subset of studies that require a letter of collaboration.

A summary of currently available data and any use restrictions

is available at https://www.ncbi.nlm.nih.gov/gap/advanced_

search/?TERM=topmed.

Although TOPMed studies have separate dbGaP accessions,

formats are standardized to facilitate combining data, with all

variants from the joint genotype call set included in the variant

call format (VCF) iles, unique sample identiiers across all of

TOPMed and sample attributes with TOPMed-speciic variables.

Notably, cross-study analyses require the identiication ofa set of

compatible study-consent groups. In addition to genotype calls,

CRAM iles with aligned sequence reads are also available, hosted

in commercial clouds and with access managed by dbGaP. The

dbGaP accession numbers for all TOPMed studies referenced in

this paper are listed in Extended Data Tables2, 3.

The TOPMed imputation reference panel is available to users for

imputation into their own samples via an imputation server. The

server performs imputation into these samples, while the reference

panel data themselves are not exposed to the user because they

derive from multiple studies with variable consent types and other

data use limitations (Extended Data Table3).

292 | Nature | Vol 590 | 11 February 2021

Article

RNA splicing and processing proteins (Supplementary Table7). Genes

associated with human disease in COSMIC

(31% depletion), the GWAS

catalogue

(around 8% depletion), OMIM

(4% depletion) and ClinVar

(4% depletion) all contained fewer rare pLOF variants than expected

(each comparison P<10

−4

The distribution of genetic variation

We examined the distribution of variant sites across the genome by

counting variants across ordered 1-megabase (Mb) concatenations

of contiguous sequence with a similar conservation level (indicated

by combined annotation-dependent depletion (CADD score

), and

in segments categorized by coding versus noncoding status (Fig.1

and Extended Data Fig.2). As expected, the vast majority of human

genomic variation is rare (minor allele frequency (MAF)<0.5%)

10,11

and

located in putatively neutral, noncoding regions of the genome (Fig.1).

Although coding regions have lower average levels of both common

(MAF≥0.5%) and rare variation, we identified some ultra-conserved

noncoding regions with even lower levels of genetic variation

(Fig.1

and Supplementary Fig.15).

Segments with notably high or low levels of variation do exist. For

example, one region on chromosome 8p (GRC 38 positions 1,000,001–

7,000,000bp) has the highest overall levels of variation (Extended

Data Fig.2). This is consistent with previous findings, as this region

has been shown to have one of the highest mutation rates across the

human genome

Although levels of common and rare variation within segments

are significantly correlated (R

=0.462, P≤2×10

−16

) (Supplementary

Fig.16), there are outliers. For example, segments overlapping the

major histocompatibility complex (MHC) have the highest levels of

common variation but no notable increase in levels of rare variation,

consistent with balancing selection

24–26

. A detailed examination of the

MHC shows peaks of increased variation and nucleotide diversity con-

sistent with assembly-based analyses of the region

(Supplementary

Fig.17). Segments with a high proportion of coding bases feature a

strong depletion in the number of common variants but only a modest

depletion in rare variants (Supplementary Fig.18).

Insights into mutation processes

A hallmark of human genetic variation is that SNVs tend to cluster

together throughout the genome

3,28

. Such patterns of clustering con-

tain important information about demographic history

, signals of

natural selection

and processes that generate mutations

. To dissect

the spatial clustering of SNVs, we analysed a collection of 50,264,223

singleton SNVs ascertained in a subset of 3,000 unrelated individuals

selected to have low levels of genetically estimated admixture—1,000

each of African, East Asian and European ancestry

(seeMethods).

In these data, we observed that 1.9% of singletons in a given indi-

vidual occur at distances of less than 100bp apart

33,34

(Supplementary

Figs.19, 20). In coalescent simulations (seeMethods), only 0.16% of the

simulated singletons within an individual were less than 100bp apart

(Supplementary Figs.19, 20). Although demographic history contrib-

utes to singleton clustering (Supplementary Information1.6), popu-

lation genetic processes alone do not fully account for the observed

clustering patterns, particularly for the most closely spaced singletons.

To better understand the latent factors that contribute to the observed

clustering, we modelled the inter-singleton distance distribution as

a mixture of exponential processes (seeMethods). The best-fitting

version of this model consisted of four mixture components

(Fig.2).

Component 1 represents singletons that occurred an average of

around 2–8bp apart and accounted for approximately 1.5% of single-

tons in each sample. These singletons are substantially enriched for A>T

and C>A transversions (Extended Data Fig.3a), consistent with the sig-

natures of trans-lesion synthesis that causes multiple non-independent

point mutations within very short spans

. The density of component 1

singletons is also associated with CpG island density (Supplementary

Fig.21). Component 2 represents singletons occurring 500–5,000bp

apart, accounting for around 12–24% of singletons. These singletons

are enriched for C>G transversions and show prominent subtelomeric

concentrations on chromosomes 8p, 9p, 16p and 16q

36,37

(Extended

Data Fig.3 and Supplementary Fig.22), consistent with the recently

described maternally derived C>G mutation clusters

36,37

. The exact

mechanism that underlies this distinctive clustering pattern is

unknown, but may involve either hypermutability of single-stranded

DNA intermediates during the repair of double-stranded breaks

36,37

transcription-associated mutagenesis, with increased damage on the

non-transcribed strand

. Our results are compatible with both these

mechanisms: component 2 singletons are enriched near regions of

H3K4 trimethylation, a mark associated with double-stranded break

response

, and depleted in exon-dense regions (Supplementary

Fig.21). Component 3 singletons (occurring approximately 30–50kilo-

bases (kb) apart) accounted for around 43–49% of all singletons, and

component 4 singletons (occurring approximately 125–170kb apart)

accounted for around 31–37% of all singletons. These latter components

Table 1 | Number of variants in 40,722 unrelated individuals in TOPMed

All unrelated individuals (n=40,722) Per individual

Total Singletons (%) Average 5th percentile Median 95th percentile

Total variants 384,127,954 203,994,740 (53) 3,748,599 3,516,166 3,563,978 4,359,661

SNVs 357,043,141 189,429,596 (53) 3,553,423 3,335,442 3,380,462 4,125,740

Indels 27,084,813 14,565,144 (54) 195,176 180,616 183,503 233,928

Novel variants 298,373,330 191,557,469 (64) 29,202 20,312 24,106 44,336

SNVs 275,141,134 177,410,620 (64) 25,027 17,520 20,975 36,861

Indels 23,232,196 14,146,849 (61) 4,175 2,747 3,145 7,359

Coding variation 4,651,453 2,523,257 (54) 23,909 22,158 22,557 27,716

Synonymous 1,435,058 715,254 (50) 11,651 10,841 11,056 13,678

Nonsynonymous 2,965,093 1,648,672 (56) 11,384 10,632 10,856 13,221

Stop/essential splice 97,217 60,347 (62) 474 425 454 566

Frameshift 104,704 71,577 (68) 132 112 127 165

In-frame 51,997 29,110 (56) 102 85 99 128

Novel variants are taken as variants that were not present in dbSNP build 149, the most recent dbSNP version without TOPMed submissions.

Nature | Vol 590 | 11 February 2021 | 293

have nearly identical mutational spectra (Extended Data Fig.3a) and

are distributed about uniformly in the genome.

Beyond SNVs and indels

To evaluate the potential of our data to generate even more com-

prehensive variation datasets, we developed and applied a method

based on denovo assembly of unmapped and mismapped read pairs,

enabling us to assemble sequences that are present in a sample but

absent, or improperly represented, in the reference. As the majority of

non-reference human sequence is present in the assembled genomes

of other primates

40,41

, we leveraged available hominid references

(seeMethods) to specifically discover retained ancestral sequences

that have been deleted in some human lineages, including on the ref-

erence haplotype.

In total, we placed 1,017 ancestral sequences, of which we were able

to fully resolve 713, ranging in length from 100bp to 39kb (N50=1,183),

and accounting for a total of 528,233bp (Fig.3a). We partially resolved

304 events, for which we assembled part of the ancestral sequence but

could place only one breakpoint on the reference sequence (see Sup-

plementary Information1.7). Out of all 1,017 events, 551 (54.18%) occur

within GENCODE v.29

genes (a proportion that is not significantly

different from 54.80% of the current reference genome GRCh38 that is

within genes). The assembled sequences contain repetitive motifs at a

significantly higher rate than the genome as a whole (58.2% versus 50.1%)

(Supplementary Tables8–10). There is a strong overrepresentation of

simple and low complexity sequences both in the reference breakpoints

and within the bodies of the non-reference sequences, which could be

indicative of the instability of these motifs and/or errors in the reference.

Considering only fully resolved events with genotyping rates above

95% (n=541), we identified between 232kb and 418kb of retained ances-

tral sequence per diploid individual. Allele frequencies of assembled

retained sequences are greater than those observed for SNVs and

indels, with 76.7% of the assembled sequences present at allele fre-

quency of more than 5% and only 12% of assembled sequences with

allele frequency of less than 0.5% (Supplementary Fig.23). This could

reflect difficulty in assembling rare haplotypes. Consistent with obser-

vations for SNVs and indels, individuals of African ancestry had, on

average, more non-reference alleles (Fig.3b, Supplementary Fig.24

and Supplementary Table11). The overwhelming majority of assem-

bled events are shared by multiple continental groups. We found 58

genic (5 of which are exonic) and 48 intergenic sequences present in

a homozygous state in all individuals in the cohort, suggesting that

the reference sequence may be incomplete at particular loci, directly

affecting the annotation of common forms of genes, such as UBE2QL1,

FOXO6 and FURIN (Supplementary Fig.25).

Comparing our findings to two previous short-read studies on dif-

ferent smaller datasets

40,41

, 356 sequences (251kb) are unique to our

call set. Additionally, we resolved the length and both breakpoints for

94 events (104kb) for which only one breakpoint had been reported

(Fig.3c). Further investigation of the overlap with insertions called

using long reads on 15 genomes

, showed that—with a single excep-

tion—all previously described events with an allele frequency of more

than 12% could be confirmed (Supplementary Fig.26).

Variation in CYP2D6

A complementary approach to denovo genome assembly is to develop

approaches that combine multiple types of information—including

previously observed haplotype variation, SNVs, indels, copy number

and homology information—to identify and classify haplotypes in inter-

esting regions of the genome. One such region is around the CYP2D6

gene, which encodes an enzyme that metabolizes approximately 25% of

prescription drugs and the activity of which varies substantially among

individuals

43–45

. More than 150 CYP2D6 haplotypes have been described,

some involving a gene conversion with its nearby non-functional but

highly similar paralogue CYP2D7.

We performed CYP2D6 haplotype analysis for all 53,831 TOPMed

individuals

43,46

. We called a total of 99 alleles (66 known and 33 novel)

230,000

200,000

170,000

140,000

110,000

80,000

50,000

20,000

15,000

Common high CADD

Rare high CADD

Common medium CADD

Rare medium CADD

Common low CADD

Rare low CADD

Coding

Noncoding

Number of variants

RareCommon

Segment index

1 2,737

Fig. 1 | Distribution of genetic variants across the genome. Common (allele

frequency≥0.5%) and rare (allele frequency<0.5%) variant counts are shown

above and below the x axis, respectively, within 1-Mb concatenated segments

(seeMethods). Segments are stratified by CADD functionality score, and

sorted based on their number of rare variants according to the functionality

category. There were 22 high CADD, 22 medium CADD and 34 low CADD coding

segments, and 40 high CADD, 238 medium CADD and 2,381 low CADD

noncoding segments. Noncoding regions of the genome with low CADD scores

(<10, reflecting lower predicted function) have the largest levels of common

and rare variation (noncoding plot region, dark and light blue, respectively),

followed by low CADD coding regions (coding plot region, dark and light blue,

respectively). Overall, the vast majority of human genomic variation comprises

rare variation.

0.001

0.01

0.1

110 100 1,000 10,000 100,000

Inter-singleton distance (bp)

Component contribution (O)

AFR EAS EUR

1234

Fig. 2 | Characteristics of singleton clustering patterns. Parameter

estimates for exponential mixture models of singleton density. Each point

represents one of the four components in one of the 3,000 individuals in the

sample, coloured according to the genetically inferred population of that

individual. The rate parameters of each component are shown across the x axis,

and the lambda parameters (that is, the proportion that the component

contributes to the mixture) are shown on the y axis (on a log–log scale).

Histograms show the distribution of the lambda and rate parameters for each

component. AFR, African ancestry; EAS, East Asian ancestry; EUR, European

ancestry.

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program

Figures

Citations

Genomewide Association Study of Severe Covid-19 with Respiratory Failure.

Benefits and limitations of genome-wide association studies.

Genetic mechanisms of critical illness in Covid-19.

Comparative genetic analysis of the novel coronavirus (2019-nCoV/SARS-CoV-2) receptor ACE2 in different populations

New insights into the genetic etiology of Alzheimer’s disease and related dementias

References

Gene Ontology: tool for the unification of biology

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data

A global reference for human genetic variation.

Analysis of protein-coding genetic variation in 60,706 humans

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

Related Papers (5)

A global reference for human genetic variation.

The mutational constraint spectrum quantified from variation in 141,456 humans

Second-generation PLINK: rising to the challenge of larger and richer datasets

PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses

The Ensembl Variant Effect Predictor.

Frequently Asked Questions (15)

Q1. How many variants were detected in the first 53,831 TOPMed samples?

Q2. What is the significance of the clustering of SNVs?

Q3. How many simulated singletons were less than 100 bp apart?

Q4. What are the main resources of the TOPMed program?

Q5. How can the authors enhance the analysis of any genotyped samples?

Q6. What are the main uses of TOPMed data?

Q7. How many variants were identified in the sample?

Q8. How many rare variants can be recovered from TOPMed?

Q9. What are the common groups of African American and Caribbean populations?

Q10. How many previously undescribed variants were detected?

Q11. How many SNVs were identified in a subset of 3,000 individuals?

Q12. What is the way to identify and classify haplotypes in the genome?

Q13. How many pLOF variants were identified in each genome?

Q14. How many variants were not described in dbSNP build 149?

Q15. What was the mean alternative allele concordance for the two variants?

Trending Questions (1)