scispace - formally typeset
Open AccessPosted ContentDOI

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program

Daniel Taliun, +194 more
- 06 Mar 2019 - 
- pp 563866
TLDR
The nearly complete catalog of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and non-coding sequence variants to phenotypic variation as well as resources and early insights from the sequence data.
Abstract
Summary paragraph The Trans-Omics for Precision Medicine (TOPMed) program seeks to elucidate the genetic architecture and disease biology of heart, lung, blood, and sleep disorders, with the ultimate goal of improving diagnosis, treatment, and prevention. The initial phases of the program focus on whole genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here, we describe TOPMed goals and design as well as resources and early insights from the sequence data. The resources include a variant browser, a genotype imputation panel, and sharing of genomic and phenotypic data via dbGaP. In 53,581 TOPMed samples, >400 million single-nucleotide and insertion/deletion variants were detected by alignment with the reference genome. Additional novel variants are detectable through assembly of unmapped reads and customized analysis in highly variable loci. Among the >400 million variants detected, 97% have frequency

read more

Content maybe subject to copyright    Report

UMass Chan Medical School UMass Chan Medical School
eScholarship@UMassChan eScholarship@UMassChan
Open Access Publications by UMMS Authors
2021-02-10
Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Sequencing of 53,831 diverse genomes from the NHLBI TOPMed
Program Program
Daniel Taliun
University of Michigan
Et al.
Let us know how access to this document bene>ts you.
Follow this and additional works at: https://escholarship.umassmed.edu/oapubs
Part of the Genomics Commons, and the Population Biology Commons
Repository Citation Repository Citation
Taliun D, McManus DD, Cupples LA, Laurie CC, Jaquish CE, Hernandez RD, O'Connor TD, Abecasis GR.
(2021). Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Open Access
Publications by UMMS Authors. https://doi.org/10.1038/s41586-021-03205-y. Retrieved from
https://escholarship.umassmed.edu/oapubs/4616
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
This material is brought to you by eScholarship@UMassChan. It has been accepted for inclusion in Open Access
Publications by UMMS Authors by an authorized administrator of eScholarship@UMassChan. For more
information, please contact Lisa.Palmer@umassmed.edu.

290 | Nature | Vol 590 | 11 February 2021
Article
Sequencing of 53,831 diverse genomes from
the NHLBI TOPMed Program
The Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the
genetic architecture and biology of heart, lung, blood and sleep disorders, with the
ultimate goal of improving diagnosis, treatment and prevention of these diseases. The
initial phases of the programme focused on whole-genome sequencing of individuals
with rich phenotypic data and diverse backgrounds. Here we describe the TOPMed
goals and design as well as the available resources and early insights obtained from
the sequence data. The resources include a variant browser, a genotype imputation
server, and genomic and phenotypic data that are available through dbGaP (Database
of Genotypes and Phenotypes)
1
. In the rst 53,831 TOPMed samples, we detected
more than 400million single-nucleotide and insertion or deletion variants after
alignment with the reference genome. Additional previously undescribed variants
were detected through assembly of unmapped reads and customized analysis in
highly variable loci. Among the more than 400million detected variants, 97% have
frequencies of less than 1% and 46% are singletons that are present in only one
individual (53% among unrelated individuals). These rare variants provide insights
into mutational processes and recent human evolutionary history. The extensive
catalogue of genetic variation in TOPMed studies provides unique opportunities for
exploring the contributions of rare and noncoding sequence variants to phenotypic
variation. Furthermore, combining TOPMed haplotypes with modern imputation
methods improves the power and reach of genome-wide association studies to
include variants down to a frequency of approximately 0.01%.
Advancing DNA-sequencing technologies and decreasing costs are
enabling researchers to explore human genetic variation at an unprec-
edented scale
2,3
. For these advances to improve our understanding of
human health, they must be deployed in well-phenotyped human sam-
ples and used to build resources such as variation catalogues
3,4
, control
collections
5,6
and imputation reference panels
7–9
. Here we describe
high-coverage whole-genome sequencing (WGS) analyses of the first
53,831 TOPMed samples (Box1 and Extended Data Tables1, 2); addi-
tional data are being made available as quality control, variant calling
and dbGaP curation are completed (altogether more than 130,000
TOPMed samples are now available in dbGaP).
A key goal of the TOPMed programme is to understand risk factors
for heart, lung, blood and sleep disorders by adding WGS and other
omics’ data to existing studies with deep phenotyping (Supplementary
Information1.1 and Supplementary Fig.1). The programme currently
consists of more than 80 participating studies, around 1,000 investi-
gators and more than 30 working groups (https://www.nhlbiwgs.org/
working-groups-public). TOPMed participants are ethnically and ances-
trally diverse (Extended Data Fig.1, Supplementary Information1.1.4
and Supplementary Fig.2). Through a combination of race and ethnicity
information (from participant questionnaires and/or study inclusion
criteria), we classified study participants into ‘population groups’,
which varied in composition according to the goals of each analysis.
In some analyses, these groups were further refined using genetic
ancestry (seeMethods and Supplementary Information for details).
Our study extends previous efforts by identifying and character-
izing the rare variants that comprise the majority of human genomic
variation
7,10–12
. Rare variants represent recent and potentially deleteri-
ous changes that can affect protein function, gene expression or other
biologically important elements
11,13,14
.
TOPMed WGS quality assessment
WGS of the TOPMed samples was performed over multiple studies,
years and sequencing centres. To minimize batch effects, we stand-
ardized laboratory methods, mapped and processed sequence data
centrally using a single pipeline, and performed variant calling and
genotyping jointly across all samples (seeMethods). We annotated each
variant site with multiple sequence quality metrics and trained machine
learning filters to identify and exclude inconsistencies that are revealed
when the same individual was sequenced repeatedly. Available WGS
data were processed periodically to produce genotype data ‘freezes’.
The 53,831 samples described here are drawn from TOPMed freeze 5.
Stringent variant and sample quality filters were applied and the
resulting genotype call sets were evaluated in several ways (Supple-
mentary Information1.2.2, 1.3, 1.4). First, we compared genotypes
for samples sequenced in duplicate (the mean alternative allele
concordance was 0.9995 for single-nucleotide variants (SNVs) and
0.9930 for insertions or deletions (indels)). Second, we compared
genotypes to those from previous whole-exome sequencing datasets
(protein-coding regions from GENCODE
15
; 80% of variants were found
with both approaches and overlapping variant calls had a concordance
of 0.9993 for SNVs and 0.9974 for indels) (Supplementary Tables1–3).
Third, we compared genotypes to those obtained using alternative
https://doi.org/10.1038/s41586-021-03205-y
Received: 6 March 2019
Accepted: 7 January 2021
Published online: 10 February 2021
Open access
Check for updates
A list of authors and their afiliations appears at the end of the paper.

Nature | Vol 590 | 11 February 2021 | 291
informatics tools (compared to GATK v.4.1.3, TOPMed has lower Mende-
lian inconsistency rates and minimizes batch effects) (Supplementary
Table4). These reproducibility estimates indicate the high quality
of the genotype calls and effectiveness of machine-learning-based
quality filters.
Batch effects were evaluated by (1) comparing distributions of
genetic principal components among sequencing centres, which
are very similar between European American and African American
individuals (Supplementary Figs.3–5); (2) comparing alternative
allele concordance between duplicates among centres, which is high
(the largest difference being 4×10
−4
), and the patterns of between-
versus within-centre differences, which indicate random errors rather
than systematic centre differences (Supplementary Figs.6–8); and (3)
performing tests of association between variants and batches, which
show a very small fraction of variants with genome-wide significance
(0.004%, Supplementary Figs.9, 10) (Supplementary Information1.2).
We conclude that batch effects appear to be minor, thus enabling
multi-study association testing.
410 million genetic variants in 53,831 samples
A total of 7.0×10
15
bases of DNA-sequencing data were generated,
consisting of an average of 129.6×10
9
bases of sequence distributed
across 864.2million paired reads (each 100–151base pairs (bp) long)
per individual. For a typical individual, 99.65% of the bases in the refer-
ence genome were covered, to a mean read depth of 38.2×.
Sequence analysis identified 410,323,831 genetic variants
(381,343,078 SNVs and 28,980,753 indels), corresponding to an aver-
age of one variant per 7bp (Extended Data Table4). Overall, 78.7% of
these variants had not been described in dbSNP build 149; TOPMed
variants now account for the majority of variants in dbSNP. Among all
variant alleles, 46.0% were singletons, observed once across all 53,831
participants. Among 40,722 unrelated participants (seeMethods), the
proportion of singleton variants was higher at 53.1% (Table1). Down-
sampling analyses show that the proportion of singletons increases
until around 15,000 unrelated individuals are sequenced and then
decreases very gradually (Supplementary Fig.11). The fraction of
singletons in each region or class of sites closely tracks functional
constraints. For example, among all 4,651,453 protein-coding variants
in unrelated individuals, the proportion of singletons was the highest
for the 104,704 frameshift variants (68.4%), high among the 97,217
putative splice and truncation variants (62.1%), intermediate among
the 2,965,093 nonsynonymous variants (55.6%) and lowest among
the 1,435,058 synonymous variants (49.8%). Beyond protein-coding
sequences, we found increased proportions of singletons in promoters
(55.0%), 5′ untranslated regions (54.7%), regions of open chromatin
(53.4%) and 3′ untranslated regions (53.3%); we found lower propor-
tions of singletons in intergenic regions (53.0%) (Supplementary
Table5). Although putative transcription factor binding sites initially
appeared to show fewer singletons (52.7%) than the remainder of the
genome (53.1%), this pattern did not hold when we analysed highly
mutable CpG sites separately. In fact, transcription factor binding sites
were enriched for singletons in both CpG sites and non-CpG sites, an
example of Simpson’s paradox
16
.
We identified an average of 3.78million variants in each genome.
Among these, an average of 30,207 (0.8%) were novel and 3,510 (0.1%)
were singletons. Among all variants, we observed 3.17million non-
synonymous and 1.53million synonymous variants (a 2.1:1 ratio), but
individual genomes contained similar numbers of nonsynonymous and
synonymous variants (11,743 nonsynonymous and 11,768 synonymous,
on average) (Extended Data Table4). The difference can be explained
if more than half of the nonsynonymous variants are removed from
the populationby natural selection before they become common.
Putative loss-of-function variants
A notable class of variants is the 228,966 putative loss-of-function
(pLOF) variants that we observed in 18,493 (95.0%) GENCODE
15
genes
(Extended Data Table5 and Supplementary Fig.12). This class includes
the highest proportion of singletons among all of the variant classes
that we examined. An average individual carried 2.5unique pLOF vari-
ants. We identified more pLOF variants per individual than in previ-
ous surveys based on exome sequencing—an increase that was mainly
driven by the identification of additional frameshift variants (Sup-
plementary Table6) and by a more uniform and complete coverage
of protein-coding regions (Supplementary Figs.13, 14).
We searched for gene sets with fewer rare pLOF variants than
expected based on gene size. The gene sets with strong functional
constraint included genes that encode DNA- and RNA-binding pro-
teins, spliceosomal complexes, translation initiation machinery and
Box 1
TOPMed participant consents
and data access
The TOPMed programme comprises more than 80 participating
studies, of which 32 are represented in the 53,831 whole genomes
described here. TOPMed has leveraged existing studies with deep
phenotyping and longitudinal follow-up data and with varied
informed consent procedures and options. Consent groups
range from broad ‘general research use’ and ‘health, medical
and biomedical’ categories to disease-speciic categories for
heart, lung, blood and/or sleep disorders. Many studies have
further consent modiiers, such as limiting use to not-for-proit
organizations or requiring documentation of local IRB approval.
Participant consents guide the appropriate use of data by TOPMed
investigators as well; therefore, the set of study-consent groups
used varies across different analyses reported in this paper
(Extended Data Table3).
TOPMed data have been deposited in dbGaP and access is
adjudicated by a staff committee of the National Institutes of
Health. The committee veriies that applications are consistent
with data use limitations and consent groups for each sample.
Study investigators have no role in the decision, except in a
small subset of studies that require a letter of collaboration.
A summary of currently available data and any use restrictions
is available at https://www.ncbi.nlm.nih.gov/gap/advanced_
search/?TERM=topmed.
Although TOPMed studies have separate dbGaP accessions,
formats are standardized to facilitate combining data, with all
variants from the joint genotype call set included in the variant
call format (VCF) iles, unique sample identiiers across all of
TOPMed and sample attributes with TOPMed-speciic variables.
Notably, cross-study analyses require the identiication ofa set of
compatible study-consent groups. In addition to genotype calls,
CRAM iles with aligned sequence reads are also available, hosted
in commercial clouds and with access managed by dbGaP. The
dbGaP accession numbers for all TOPMed studies referenced in
this paper are listed in Extended Data Tables2, 3.
The TOPMed imputation reference panel is available to users for
imputation into their own samples via an imputation server. The
server performs imputation into these samples, while the reference
panel data themselves are not exposed to the user because they
derive from multiple studies with variable consent types and other
data use limitations (Extended Data Table3).

292 | Nature | Vol 590 | 11 February 2021
Article
RNA splicing and processing proteins (Supplementary Table7). Genes
associated with human disease in COSMIC
17
(31% depletion), the GWAS
catalogue
18
(around 8% depletion), OMIM
19
(4% depletion) and ClinVar
20
(4% depletion) all contained fewer rare pLOF variants than expected
(each comparison P<10
−4
).
The distribution of genetic variation
We examined the distribution of variant sites across the genome by
counting variants across ordered 1-megabase (Mb) concatenations
of contiguous sequence with a similar conservation level (indicated
by combined annotation-dependent depletion (CADD score
21
), and
in segments categorized by coding versus noncoding status (Fig.1
and Extended Data Fig.2). As expected, the vast majority of human
genomic variation is rare (minor allele frequency (MAF)<0.5%)
10,11
and
located in putatively neutral, noncoding regions of the genome (Fig.1).
Although coding regions have lower average levels of both common
(MAF≥0.5%) and rare variation, we identified some ultra-conserved
noncoding regions with even lower levels of genetic variation
22
(Fig.1
and Supplementary Fig.15).
Segments with notably high or low levels of variation do exist. For
example, one region on chromosome 8p (GRC 38 positions 1,000,001–
7,000,000bp) has the highest overall levels of variation (Extended
Data Fig.2). This is consistent with previous findings, as this region
has been shown to have one of the highest mutation rates across the
human genome
23
.
Although levels of common and rare variation within segments
are significantly correlated (R
2
=0.462, P≤2×10
−16
) (Supplementary
Fig.16), there are outliers. For example, segments overlapping the
major histocompatibility complex (MHC) have the highest levels of
common variation but no notable increase in levels of rare variation,
consistent with balancing selection
2426
. A detailed examination of the
MHC shows peaks of increased variation and nucleotide diversity con-
sistent with assembly-based analyses of the region
27
(Supplementary
Fig.17). Segments with a high proportion of coding bases feature a
strong depletion in the number of common variants but only a modest
depletion in rare variants (Supplementary Fig.18).
Insights into mutation processes
A hallmark of human genetic variation is that SNVs tend to cluster
together throughout the genome
3,28
. Such patterns of clustering con-
tain important information about demographic history
29
, signals of
natural selection
30
and processes that generate mutations
31
. To dissect
the spatial clustering of SNVs, we analysed a collection of 50,264,223
singleton SNVs ascertained in a subset of 3,000 unrelated individuals
selected to have low levels of genetically estimated admixture—1,000
each of African, East Asian and European ancestry
32
(seeMethods).
In these data, we observed that 1.9% of singletons in a given indi-
vidual occur at distances of less than 100bp apart
33,34
(Supplementary
Figs.19, 20). In coalescent simulations (seeMethods), only 0.16% of the
simulated singletons within an individual were less than 100bp apart
(Supplementary Figs.19, 20). Although demographic history contrib-
utes to singleton clustering (Supplementary Information1.6), popu-
lation genetic processes alone do not fully account for the observed
clustering patterns, particularly for the most closely spaced singletons.
To better understand the latent factors that contribute to the observed
clustering, we modelled the inter-singleton distance distribution as
a mixture of exponential processes (seeMethods). The best-fitting
version of this model consisted of four mixture components
(Fig.2).
Component 1 represents singletons that occurred an average of
around 2–8bp apart and accounted for approximately 1.5% of single-
tons in each sample. These singletons are substantially enriched for A>T
and C>A transversions (Extended Data Fig.3a), consistent with the sig-
natures of trans-lesion synthesis that causes multiple non-independent
point mutations within very short spans
35
. The density of component 1
singletons is also associated with CpG island density (Supplementary
Fig.21). Component 2 represents singletons occurring 500–5,000bp
apart, accounting for around 12–24% of singletons. These singletons
are enriched for C>G transversions and show prominent subtelomeric
concentrations on chromosomes 8p, 9p, 16p and 16q
36,37
(Extended
Data Fig.3 and Supplementary Fig.22), consistent with the recently
described maternally derived C>G mutation clusters
36,37
. The exact
mechanism that underlies this distinctive clustering pattern is
unknown, but may involve either hypermutability of single-stranded
DNA intermediates during the repair of double-stranded breaks
36,37
or
transcription-associated mutagenesis, with increased damage on the
non-transcribed strand
38
. Our results are compatible with both these
mechanisms: component 2 singletons are enriched near regions of
H3K4 trimethylation, a mark associated with double-stranded break
response
39
, and depleted in exon-dense regions (Supplementary
Fig.21). Component 3 singletons (occurring approximately 30–50kilo-
bases (kb) apart) accounted for around 43–49% of all singletons, and
component 4 singletons (occurring approximately 125–170kb apart)
accounted for around 31–37% of all singletons. These latter components
Table 1 | Number of variants in 40,722 unrelated individuals in TOPMed
All unrelated individuals (n=40,722) Per individual
Total Singletons (%) Average 5th percentile Median 95th percentile
Total variants 384,127,954 203,994,740 (53) 3,748,599 3,516,166 3,563,978 4,359,661
SNVs 357,043,141 189,429,596 (53) 3,553,423 3,335,442 3,380,462 4,125,740
Indels 27,084,813 14,565,144 (54) 195,176 180,616 183,503 233,928
Novel variants 298,373,330 191,557,469 (64) 29,202 20,312 24,106 44,336
SNVs 275,141,134 177,410,620 (64) 25,027 17,520 20,975 36,861
Indels 23,232,196 14,146,849 (61) 4,175 2,747 3,145 7,359
Coding variation 4,651,453 2,523,257 (54) 23,909 22,158 22,557 27,716
Synonymous 1,435,058 715,254 (50) 11,651 10,841 11,056 13,678
Nonsynonymous 2,965,093 1,648,672 (56) 11,384 10,632 10,856 13,221
Stop/essential splice 97,217 60,347 (62) 474 425 454 566
Frameshift 104,704 71,577 (68) 132 112 127 165
In-frame 51,997 29,110 (56) 102 85 99 128
Novel variants are taken as variants that were not present in dbSNP build 149, the most recent dbSNP version without TOPMed submissions.

Nature | Vol 590 | 11 February 2021 | 293
have nearly identical mutational spectra (Extended Data Fig.3a) and
are distributed about uniformly in the genome.
Beyond SNVs and indels
To evaluate the potential of our data to generate even more com-
prehensive variation datasets, we developed and applied a method
based on denovo assembly of unmapped and mismapped read pairs,
enabling us to assemble sequences that are present in a sample but
absent, or improperly represented, in the reference. As the majority of
non-reference human sequence is present in the assembled genomes
of other primates
40,41
, we leveraged available hominid references
(seeMethods) to specifically discover retained ancestral sequences
that have been deleted in some human lineages, including on the ref-
erence haplotype.
In total, we placed 1,017 ancestral sequences, of which we were able
to fully resolve 713, ranging in length from 100bp to 39kb (N50=1,183),
and accounting for a total of 528,233bp (Fig.3a). We partially resolved
304 events, for which we assembled part of the ancestral sequence but
could place only one breakpoint on the reference sequence (see Sup-
plementary Information1.7). Out of all 1,017 events, 551 (54.18%) occur
within GENCODE v.29
15
genes (a proportion that is not significantly
different from 54.80% of the current reference genome GRCh38 that is
within genes). The assembled sequences contain repetitive motifs at a
significantly higher rate than the genome as a whole (58.2% versus 50.1%)
(Supplementary Tables8–10). There is a strong overrepresentation of
simple and low complexity sequences both in the reference breakpoints
and within the bodies of the non-reference sequences, which could be
indicative of the instability of these motifs and/or errors in the reference.
Considering only fully resolved events with genotyping rates above
95% (n=541), we identified between 232kb and 418kb of retained ances-
tral sequence per diploid individual. Allele frequencies of assembled
retained sequences are greater than those observed for SNVs and
indels, with 76.7% of the assembled sequences present at allele fre-
quency of more than 5% and only 12% of assembled sequences with
allele frequency of less than 0.5% (Supplementary Fig.23). This could
reflect difficulty in assembling rare haplotypes. Consistent with obser-
vations for SNVs and indels, individuals of African ancestry had, on
average, more non-reference alleles (Fig.3b, Supplementary Fig.24
and Supplementary Table11). The overwhelming majority of assem-
bled events are shared by multiple continental groups. We found 58
genic (5 of which are exonic) and 48 intergenic sequences present in
a homozygous state in all individuals in the cohort, suggesting that
the reference sequence may be incomplete at particular loci, directly
affecting the annotation of common forms of genes, such as UBE2QL1,
FOXO6 and FURIN (Supplementary Fig.25).
Comparing our findings to two previous short-read studies on dif-
ferent smaller datasets
40,41
, 356 sequences (251kb) are unique to our
call set. Additionally, we resolved the length and both breakpoints for
94 events (104kb) for which only one breakpoint had been reported
(Fig.3c). Further investigation of the overlap with insertions called
using long reads on 15 genomes
42
, showed that—with a single excep-
tion—all previously described events with an allele frequency of more
than 12% could be confirmed (Supplementary Fig.26).
Variation in CYP2D6
A complementary approach to denovo genome assembly is to develop
approaches that combine multiple types of information—including
previously observed haplotype variation, SNVs, indels, copy number
and homology information—to identify and classify haplotypes in inter-
esting regions of the genome. One such region is around the CYP2D6
gene, which encodes an enzyme that metabolizes approximately 25% of
prescription drugs and the activity of which varies substantially among
individuals
4345
. More than 150 CYP2D6 haplotypes have been described,
some involving a gene conversion with its nearby non-functional but
highly similar paralogue CYP2D7.
We performed CYP2D6 haplotype analysis for all 53,831 TOPMed
individuals
43,46
. We called a total of 99 alleles (66 known and 33 novel)
230,000
200,000
170,000
140,000
110,000
80,000
50,000
20,000
0
15,000
Common high CADD
Rare high CADD
Common medium CADD
Rare medium CADD
Common low CADD
Rare low CADD
Coding
Noncoding
Number of variants
RareCommon
Segment index
1 2,737
Fig. 1 | Distribution of genetic variants across the genome. Common (allele
frequency≥0.5%) and rare (allele frequency<0.5%) variant counts are shown
above and below the x axis, respectively, within 1-Mb concatenated segments
(seeMethods). Segments are stratified by CADD functionality score, and
sorted based on their number of rare variants according to the functionality
category. There were 22 high CADD, 22 medium CADD and 34 low CADD coding
segments, and 40 high CADD, 238 medium CADD and 2,381 low CADD
noncoding segments. Noncoding regions of the genome with low CADD scores
(<10, reflecting lower predicted function) have the largest levels of common
and rare variation (noncoding plot region, dark and light blue, respectively),
followed by low CADD coding regions (coding plot region, dark and light blue,
respectively). Overall, the vast majority of human genomic variation comprises
rare variation.
0.001
0.01
0.1
1
110 100 1,000 10,000 100,000
Inter-singleton distance (bp)
Component contribution (O)
AFR EAS EUR
1234
Fig. 2 | Characteristics of singleton clustering patterns. Parameter
estimates for exponential mixture models of singleton density. Each point
represents one of the four components in one of the 3,000 individuals in the
sample, coloured according to the genetically inferred population of that
individual. The rate parameters of each component are shown across the x axis,
and the lambda parameters (that is, the proportion that the component
contributes to the mixture) are shown on the y axis (on a log–log scale).
Histograms show the distribution of the lambda and rate parameters for each
component. AFR, African ancestry; EAS, East Asian ancestry; EUR, European
ancestry.

Figures
Citations
More filters
Journal ArticleDOI

Genomewide Association Study of Severe Covid-19 with Respiratory Failure.

David Ellinghaus, +145 more
TL;DR: A 3p21.31 gene cluster is identified as a genetic susceptibility locus in patients with Covid-19 with respiratory failure and a potential involvement of the ABO blood-group system is confirmed.
Journal ArticleDOI

Benefits and limitations of genome-wide association studies.

TL;DR: This Review comprehensively assess the benefits and limitations of GWAS in human populations and discusses the relevance of performing more GWAS, with a focus on the cardiometabolic field.
Journal ArticleDOI

Genetic mechanisms of critical illness in Covid-19.

Erola Pairo-Castineira, +1449 more
- 04 Mar 2021 - 
TL;DR: The GenOMICC (Genetics Of Mortality In Critical Care) genome-wide association study in 2244 critically ill Covid-19 patients from 208 UK intensive care units is reported, finding evidence in support of a causal link from low expression of IFNAR2, and high expression of TYK2, to life-threatening disease.
Journal ArticleDOI

Comparative genetic analysis of the novel coronavirus (2019-nCoV/SARS-CoV-2) receptor ACE2 in different populations

TL;DR: Communities and local governments across the country face a period of extreme uncertainty Whether or not COVID-19 is quickly contained, changes in consumer habits and attitudes to climate change are likely to change.
Journal ArticleDOI

New insights into the genetic etiology of Alzheimer’s disease and related dementias

Céline Bellenguez, +401 more
- 01 Apr 2022 - 
TL;DR: This paper performed a two-stage genome-wide association study with 111,326 clinically diagnosed/proxy AD cases and 677,663 controls and found 75 risk loci, of which 42 were new at the time of analysis.
References
More filters
Journal ArticleDOI

Gene Ontology: tool for the unification of biology

TL;DR: The goal of the Gene Ontology Consortium is to produce a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing.
Journal ArticleDOI

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data

TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Journal ArticleDOI

A global reference for human genetic variation.

Adam Auton, +517 more
- 01 Oct 2015 - 
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Journal ArticleDOI

Analysis of protein-coding genetic variation in 60,706 humans

Monkol Lek, +106 more
- 18 Aug 2016 - 
TL;DR: The aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC) provides direct evidence for the presence of widespread mutational recurrence.
Posted ContentDOI

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

Heng Li
- 16 Mar 2013 - 
TL;DR: BWA-MEM automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment, which is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases.
Related Papers (5)

A global reference for human genetic variation.

Adam Auton, +517 more
- 01 Oct 2015 - 
Frequently Asked Questions (15)
Q1. How many variants were detected in the first 53,831 TOPMed samples?

In the first 53,831 TOPMed samples, the authors detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome. 

Insights into mutation processes A hallmark of human genetic variation is that SNVs tend to cluster together throughout the genome3,28. 

In coalescent simulations (see Methods), only 0.16% of the simulated singletons within an individual were less than 100 bp apart (Supplementary Figs. 19, 20). 

Members of the broader scientific community are using TOPMed resources through the WGS and phenotype data available on dbGaP, the BRAVO variant server and the imputation reference panel on the TOPMed imputation server. 

In addition to enabling detailed analysis of TOPMed sequenced samples, TOPMed can enhance the analysis of any genotyped samples72. 

In addition to these uses, the authors expect that TOPMed data will improve nearly all ongoing studies of common and rare disorders by providing both a deep catalogue of variation in healthy individuals and an imputation resource that enables array-based studies to achieve a completeness that was previously attainable only through direct sequencing. 

Sequence analysis identified 410,323,831 genetic variants (381,343,078 SNVs and 28,980,753 indels), corresponding to an average of one variant per 7 bp (Extended Data Table 4). 

This means that 89% of the approximately 80,000 rare variants with MAF < 0.5% in an average genome of African ancestry can be recovered through genotype imputation using the TOPMed panel. 

As expected, African American and Caribbean population groups have the greatest heterozygosity7,47, followed by Hispanic/Latino, European American, Amish, East Asian and Samoan groups. 

Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci. 

To dissect the spatial clustering of SNVs, the authors analysed a collection of 50,264,223 singleton SNVs ascertained in a subset of 3,000 unrelated individuals selected to have low levels of genetically estimated admixture—1,000 each of African, East Asian and European ancestry32 (see Methods). 

A complementary approach to de novo genome assembly is to develop approaches that combine multiple types of information—including previously observed haplotype variation, SNVs, indels, copy number and homology information—to identify and classify haplotypes in interesting regions of the genome. 

The authors identified more pLOF variants per individual than in previous surveys based on exome sequencing—an increase that was mainly driven by the identification of additional frameshift variants (Supplementary Table 6) and by a more uniform and complete coverage of protein-coding regions (Supplementary Figs. 13, 14). 

78.7% of these variants had not been described in dbSNP build 149; TOPMed variants now account for the majority of variants in dbSNP. 

the authors compared genotypes for samples sequenced in duplicate (the mean alternative allele concordance was 0.9995 for single-nucleotide variants (SNVs) and 0.9930 for insertions or deletions (indels)). 

Trending Questions (1)
Whole-genome association analyses of sleep-disordered breathing phenotypes in the NHLBI TOPMed program

The paper does not provide information about whole-genome association analyses of sleep-disordered breathing phenotypes in the NHLBI TOPMed program. The paper primarily focuses on the goals, design, and early insights from the sequencing data of the TOPMed program.