Home
/
Authors
/
Yujun Zhang

Author

Yujun Zhang

Other affiliations: Wellcome Trust, Wellcome Trust Sanger Institute

Bio: Yujun Zhang is an academic researcher from Peking Union Medical College. The author has contributed to research in topics: Genome & Genomics. The author has an hindex of 12, co-authored 17 publications receiving 15093 citations. Previous affiliations of Yujun Zhang include Wellcome Trust & Wellcome Trust Sanger Institute.

Topics: Genome, Genomics, Structural variation, Population, Copy-number variation ...read more

Papers

PDF

Open Access

More filters

Journal Article•DOI•

A global reference for human genetic variation.

[...]

Adam Auton¹, Gonçalo R. Abecasis², David Altshuler³, Richard Durbin⁴ +514 more•Institutions (90)

01 Oct 2015-Nature

TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.

...read moreread less

Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

...read moreread less

12,661 citations

A global reference for human genetic variation

[...]

Adam Auton, Gonçalo R. Abecasis, David Altshuler, Richard Durbin +476 more

01 Oct 2015

TL;DR: The 1000 Genomes Project as mentioned in this paper provided a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and reported the completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole genome sequencing, deep exome sequencing and dense microarray genotyping.

...read moreread less

3,247 citations

Journal Article•DOI•

Origins and functional impact of copy number variation in the human genome

[...]

Donald F. Conrad¹, Dalila Pinto², Richard Redon³, Richard Redon¹, Lars Feuk⁴, Lars Feuk², Omer Gokcumen⁵, Yujun Zhang¹, Jan Aerts¹, T. Daniel Andrews¹, Chris P. Barnes¹, Peter J. Campbell¹, Tomas W Fitzgerald¹, Min Hu¹, Chun Hwa Ihm⁵, K. Kristiansson¹, Daniel G. MacArthur¹, Jeffrey R. MacDonald², Ifejinelo Onyiah¹, Andy Wing Chun Pang², Samuel Robson¹, Kathy Stirrups¹, Armand Valsesia¹, Klaudia Walter¹, John Wei², Chris Tyler-Smith¹, Nigel P. Carter¹, Charles Lee⁵, Stephen W. Scherer⁶, Stephen W. Scherer², Matthew E. Hurles¹ - Show less +27 more•Institutions (6)

Wellcome Trust Sanger Institute¹, The Centre for Applied Genomics², French Institute of Health and Medical Research³, Uppsala University⁴, Brigham and Women's Hospital⁵, University of Toronto⁶

01 Apr 2010-Nature

TL;DR: It is concluded that the heritability void left by genome-wide association studies will not be accounted for by common CNVs, and 30 loci with CNVs that are candidates for influencing disease susceptibility are identified.

...read moreread less

Abstract: Structural variations of DNA greater than 1 kilobase in size account for most bases that vary among human genomes, but are still relatively under-ascertained. Here we use tiling oligonucleotide microarrays, comprising 42 million probes, to generate a comprehensive map of 11,700 copy number variations (CNVs) greater than 443 base pairs, of which most (8,599) have been validated independently. For 4,978 of these CNVs, we generated reference genotypes from 450 individuals of European, African or East Asian ancestry. The predominant mutational mechanisms differ among CNV size classes. Retrotransposition has duplicated and inserted some coding and non-coding DNA segments randomly around the genome. Furthermore, by correlation with known trait-associated single nucleotide polymorphisms (SNPs), we identified 30 loci with CNVs that are candidates for influencing disease susceptibility. Despite this, having assessed the completeness of our map and the patterns of linkage disequilibrium between CNVs and SNPs, we conclude that, for complex traits, the heritability void left by genome-wide association studies will not be accounted for by common CNVs.

...read moreread less

1,892 citations

Journal Article•DOI•

Mapping copy number variation by population-scale genome sequencing

[...]

Ryan E. Mills¹, Klaudia Walter², Chip Stewart³, Robert E. Handsaker⁴ +371 more•Institutions (21)

03 Feb 2011-Nature

TL;DR: A map of unbalanced SVs is constructed based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations, and serves as a resource for sequencing-based association studies.

...read moreread less

Abstract: Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.

...read moreread less

1,085 citations

A map of human genome variation from population-scale sequencing

[...]

Richard Durbin, David Altshuler, Gonçalo R. Abecasis, David R. Bentley +358 more

01 Oct 2010

TL;DR: The pilot phase of the 1000 Genomes Project is presented, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms, and the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants are described.

...read moreread less

599 citations

1
2
3
4
…

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

A framework for variation discovery and genotyping using next-generation DNA sequencing data

[...]

Mark A. DePristo¹, Eric Banks¹, Ryan Poplin¹, Kiran V. Garimella¹, Jared Maguire¹, Christopher Hartl¹, Anthony A. Philippakis¹, Anthony A. Philippakis², Anthony A. Philippakis³, Guillermo del Angel¹, Manuel A. Rivas¹, Manuel A. Rivas², Matt Hanna¹, Aaron McKenna¹, Timothy Fennell¹, Andrew Kernytsky¹, Andrey Sivachenko¹, Kristian Cibulskis¹, Stacey Gabriel¹, David Altshuler¹, David Altshuler², Mark J. Daly², Mark J. Daly¹ - Show less +19 more•Institutions (3)

Broad Institute¹, Harvard University², Brigham and Women's Hospital³

01 May 2011-Nature Genetics

TL;DR: A unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs is presented.

...read moreread less

Abstract: Recent advances in sequencing technology make it possible to comprehensively catalogue genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (1) initial read mapping; (2) local realignment around indels; (3) base quality score recalibration; (4) SNP discovery and genotyping to find all potential variants; and (5) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We discuss the application of these tools, instantiated in the Genome Analysis Toolkit (GATK), to deep whole-genome, whole-exome capture, and multi-sample low-pass (~4×) 1000 Genomes Project datasets.

...read moreread less

10,056 citations

Journal Article•DOI•

Analysis of protein-coding genetic variation in 60,706 humans

[...]

Monkol Lek, Konrad J. Karczewski¹, Konrad J. Karczewski², Eric Vallabh Minikel², Eric Vallabh Minikel¹, Kaitlin E. Samocha, Eric Banks¹, Timothy Fennell¹, Anne H. O’Donnell-Luria³, Anne H. O’Donnell-Luria¹, Anne H. O’Donnell-Luria², James S. Ware, Andrew J. Hill¹, Andrew J. Hill², Andrew J. Hill⁴, Beryl B. Cummings¹, Beryl B. Cummings², Taru Tukiainen², Taru Tukiainen¹, Daniel P. Birnbaum¹, Jack A. Kosmicki, Laramie E. Duncan¹, Laramie E. Duncan², Karol Estrada², Karol Estrada¹, Fengmei Zhao², Fengmei Zhao¹, James Zou¹, Emma Pierce-Hoffman¹, Emma Pierce-Hoffman², Joanne Berghout⁵, David Neil Cooper⁶, Nicole A. Deflaux⁷, Mark A. DePristo¹, Ron Do, Jason Flannick², Jason Flannick¹, Menachem Fromer, Laura D. Gauthier¹, Jackie Goldstein¹, Jackie Goldstein², Namrata Gupta¹, Daniel P. Howrigan¹, Daniel P. Howrigan², Adam Kiezun¹, Mitja I. Kurki², Mitja I. Kurki¹, Ami Levy Moonshine¹, Pradeep Natarajan, Lorena Orozco, Gina M. Peloso², Gina M. Peloso¹, Ryan Poplin¹, Manuel A. Rivas¹, Valentin Ruano-Rubio¹, Samuel A. Rose¹, Douglas M. Ruderfer⁸, Khalid Shakir¹, Peter D. Stenson⁶, Christine Stevens¹, Brett Thomas¹, Brett Thomas², Grace Tiao¹, María Teresa Tusié-Luna, Ben Weisburd¹, Hong-Hee Won⁹, Dongmei Yu, David Altshuler¹⁰, David Altshuler¹, Diego Ardissino, Michael Boehnke¹¹, John Danesh¹², Stacey Donnelly¹, Roberto Elosua, Jose C. Florez¹, Jose C. Florez², Stacey Gabriel¹, Gad Getz², Gad Getz¹, Stephen J. Glatt¹³, Christina M. Hultman¹⁴, Sekar Kathiresan, Markku Laakso¹⁵, Steven A. McCarroll¹, Steven A. McCarroll², Mark I. McCarthy¹⁶, Mark I. McCarthy¹⁷, Dermot P.B. McGovern¹⁸, Ruth McPherson¹⁹, Benjamin M. Neale², Benjamin M. Neale¹, Aarno Palotie, Shaun Purcell⁸, Danish Saleheen²⁰, Jeremiah M. Scharf, Pamela Sklar, Patrick F. Sullivan²¹, Patrick F. Sullivan¹⁴, Jaakko Tuomilehto²², Ming T. Tsuang²³, Hugh Watkins¹⁷, Hugh Watkins¹⁶, James G. Wilson²⁴, Mark J. Daly¹, Mark J. Daly², Daniel G. MacArthur², Daniel G. MacArthur¹ - Show less +103 more•Institutions (24)

Broad Institute¹, Harvard University², Boston Children's Hospital³, University of Washington⁴, University of Arizona⁵, Cardiff University⁶, Google⁷, Icahn School of Medicine at Mount Sinai⁸, Samsung Medical Center⁹, Vertex Pharmaceuticals¹⁰, University of Michigan¹¹, University of Cambridge¹², State University of New York Upstate Medical University¹³, Karolinska Institutet¹⁴, University of Eastern Finland¹⁵, University of Oxford¹⁶, Wellcome Trust Centre for Human Genetics¹⁷, Cedars-Sinai Medical Center¹⁸, University of Ottawa¹⁹, University of Pennsylvania²⁰, University of North Carolina at Chapel Hill²¹, University of Helsinki²², University of California, San Diego²³, University of Mississippi Medical Center²⁴

18 Aug 2016-Nature

TL;DR: The aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC) provides direct evidence for the presence of widespread mutational recurrence.

...read moreread less

Abstract: Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of predicted protein-truncating variants, with 72% of these genes having no currently established human disease phenotype. Finally, we demonstrate that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human 'knockout' variants in protein-coding genes.

...read moreread less

8,758 citations

Journal Article•DOI•

An integrated map of genetic variation from 1,092 human genomes

[...]

Gonçalo R. Abecasis¹, Adam Auton², Lisa D. Brooks³, Mark A. DePristo⁴, Richard Durbin⁵, Robert E. Handsaker⁴, Robert E. Handsaker⁶, Hyun Min Kang¹, Gabor T. Marth⁷, Gil McVean⁸ - Show less +6 more•Institutions (8)

University of Michigan¹, Yeshiva University², National Institutes of Health³, Broad Institute⁴, Wellcome Trust Sanger Institute⁵, Harvard University⁶, Boston College⁷, University of Oxford⁸

01 Nov 2012-Nature

TL;DR: It is shown that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites.

...read moreread less

Abstract: By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

...read moreread less

7,710 citations

Journal Article•DOI•

A Map of Human Genome Variation From Population-Scale Sequencing

[...]

Gonçalo R. Abecasis¹, David Altshuler², David Altshuler³, Adam Auton⁴, Lisa D Brooks⁵, Richard Durbin⁶, Richard A. Gibbs⁷, Matthew E. Hurles⁶, Gil McVean⁴ - Show less +5 more•Institutions (7)

University of Michigan¹, Broad Institute², Harvard University³, University of Oxford⁴, Johns Hopkins University⁵, Wellcome Trust Sanger Institute⁶, Baylor College of Medicine⁷

28 Oct 2010-Nature

TL;DR: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype as mentioned in this paper, and the results of the pilot phase of the project, designed to develop and compare different strategies for genomewide sequencing with high-throughput platforms.

...read moreread less

Abstract: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

...read moreread less

7,538 citations

Journal Article•DOI•

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

[...]

Heng Li¹•Institutions (1)

Broad Institute¹

01 Nov 2011-Bioinformatics

TL;DR: This work presents a statistical framework for calling SNPs, discovering somatic mutations, inferring population genetical parameters and performing association tests directly based on sequencing data without explicit genotyping or linkage-based imputation and demonstrates that this method achieves comparable accuracy to alternative methods for estimating site allele count, for inferring allele frequency spectrum and for association mapping.

...read moreread less

Abstract: Motivation: Most existing methods for DNA sequence analysis rely on accurate sequences or genotypes. However, in applications of the next-generation sequencing (NGS), accurate genotypes may not be easily obtained (e.g. multi-sample low-coverage sequencing or somatic mutation discovery). These applications press for the development of new methods for analyzing sequence data with uncertainty. Results: We present a statistical framework for calling SNPs, discovering somatic mutations, inferring population genetical parameters and performing association tests directly based on sequencing data without explicit genotyping or linkage-based imputation. On real data, we demonstrate that our method achieves comparable accuracy to alternative methods for estimating site allele count, for inferring allele frequency spectrum and for association mapping. We also highlight the necessity of using symmetric datasets for finding somatic mutations and confirm that for discovering rare events, mismapping is frequently the leading source of errors. Availability: http://samtools.sourceforge.net Contact: hengli@broadinstitute.org

...read moreread less

4,949 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse