Showing papers by "Yingrui Li published in 2010"
••
TL;DR: The Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million non-redundant microbial genes, derived from 576.7 gigabases of sequence, from faecal samples of 124 European individuals are described, indicating that the entire cohort harbours between 1,000 and 1,150 prevalent bacterial species and each individual at least 160 such species.
Abstract: To understand the impact of gut microbes on human health and well-being it is crucial to assess their genetic potential. Here we describe the Illumina-based metagenomic sequencing, assembly and characterization of 3.3 million non-redundant microbial genes, derived from 576.7 gigabases of sequence, from faecal samples of 124 European individuals. The gene set, ~150 times larger than the human gene complement, contains an overwhelming majority of the prevalent (more frequent) microbial genes of the cohort and probably includes a large proportion of the prevalent human intestinal microbial genes. The genes are largely shared among individuals of the cohort. Over 99% of the genes are bacterial, indicating that the entire cohort harbours between 1,000 and 1,150 prevalent bacterial species and each individual at least 160 such species, which are also largely shared. We define and describe the minimal gut metagenome and the minimal gut bacterial genome in terms of functions present in all individuals and most bacteria, respectively
9,268 citations
••
TL;DR: The development of this de novo short read assembly method creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way.
Abstract: Next-generation massively parallel DNA sequencing technologies provide ultrahigh throughput at a substantially lower unit data cost; however, the data are very short read length sequences, making de novo assembly extremely challenging. Here, we describe a novel method for de novo assembly of large genomes from short read sequences. We successfully assembled both the Asian and African human genome sequences, achieving an N50 contig size of 7.4 and 5.9 kilobases (kb) and scaffold of 446.3 and 61.9 kb, respectively. The development of this de novo short read assembly method creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way.
2,760 citations
••
TL;DR: A population genomic survey has revealed a functionally important locus in genetic adaptation to high altitude, and the strongest signal of natural selection came from endothelial Per-Arnt-Sim domain protein 1 (EPAS1), a transcription factor involved in response to hypoxia.
Abstract: Residents of the Tibetan Plateau show heritable adaptations to extreme altitude. We sequenced 50 exomes of ethnic Tibetans, encompassing coding sequences of 92% of human genes, with an average coverage of 18x per individual. Genes showing population-specific allele frequency changes, which represent strong candidates for altitude adaptation, were identified. The strongest signal of natural selection came from endothelial Per-Arnt-Sim (PAS) domain protein 1 (EPAS1), a transcription factor involved in response to hypoxia. One single-nucleotide polymorphism (SNP) at EPAS1 shows a 78% frequency difference between Tibetan and Han samples, representing the fastest allele frequency change observed at any human gene to date. This SNP's association with erythrocyte abundance supports the role of EPAS1 in adaptation to hypoxia. Thus, a population genomic survey has revealed a functionally important locus in genetic adaptation to high altitude.
1,325 citations
••
Chinese Academy of Sciences1, Shanghai Jiao Tong University2, Fudan University3, Kunming Institute of Zoology4, Shenzhen University5, Chengdu Research Base of Giant Panda Breeding6, Wellcome Trust7, University of Toronto8, University of California, Berkeley9, Southeast University10, University of Hong Kong11, Sun Yat-sen University12, University of Vienna13, Cardiff University14, Comenius University in Bratislava15, Sichuan University16, South China University of Technology17, University of Copenhagen18, University of Alberta19, University of Washington20
TL;DR: Using next-generation sequencing technology alone, a draft sequence of the giant panda genome is generated and assembled, indicating that its bamboo diet might be more dependent on its gut microbiome than its own genetic composition.
Abstract: Using next-generation sequencing technology alone, we have successfully generated and assembled a draft sequence of the giant panda genome. The assembled contigs (2.25 gigabases (Gb)) cover approximately 94% of the whole genome, and the remaining gaps (0.05 Gb) seem to contain carnivore-specific repeats and tandem repeats. Comparisons with the dog and human showed that the panda genome has a lower divergence rate. The assessment of panda genes potentially underlying some of its unique traits indicated that its bamboo diet might be more dependent on its gut microbiome than its own genetic composition. We also identified more than 2.7 million heterozygous single nucleotide polymorphisms in the diploid genome. Our data and analyses provide a foundation for promoting mammalian genetic research, and demonstrate the feasibility for using next-generation sequencing technologies for accurate, cost-effective and rapid de novo assembly of large eukaryotic genomes.
1,109 citations
••
University of Copenhagen1, Estonian Biocentre2, University of Cambridge3, Technical University of Denmark4, University of California, Berkeley5, University of Oxford6, University of Bradford7, Murdoch University8, Australian Federal Police9, École normale supérieure de Lyon10, Aarhus University11, Russian Academy12, Russian Academy of Sciences13, University of Kansas14
TL;DR: This genome sequence of an ancient human obtained from ∼4,000-year-old permafrost-preserved hair provides evidence for a migration from Siberia into the New World some 5,500 years ago, independent of that giving rise to the modern Native Americans and Inuit.
Abstract: We report here the genome sequence of an ancient human. Obtained from approximately 4,000-year-old permafrost-preserved hair, the genome represents a male individual from the first known culture to settle in Greenland. Sequenced to an average depth of 20x, we recover 79% of the diploid genome, an amount close to the practical limit of current sequencing technologies. We identify 353,151 high-confidence single-nucleotide polymorphisms (SNPs), of which 6.8% have not been reported previously. We estimate raw read contamination to be no higher than 0.8%. We use functional SNP assessment to assign possible phenotypic characteristics of the individual that belonged to a culture whose location has yielded only trace human remains. We compare the high-confidence SNPs to those of contemporary populations to find the populations most closely related to the individual. This provides evidence for a migration from Siberia into the New World some 5,500 years ago, independent of that giving rise to the modern Native Americans and Inuit.
749 citations
01 Oct 2010
TL;DR: The pilot phase of the 1000 Genomes Project is presented, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms, and the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants are described.
599 citations
••
TL;DR: Analysis across the genome of patterns of DNA methylation reveals a rich landscape of allele-specific epigenetic modification and consequent effects on allele- specific gene expression.
Abstract: DNA methylation plays an important role in biological processes in human health and disease. Recent technological advances allow unbiased whole-genome DNA methylation (methylome) analysis to be carried out on human cells. Using whole-genome bisulfite sequencing at 24.7-fold coverage (12.3-fold per strand), we report a comprehensive (92.62%) methylome and analysis of the unique sequences in human peripheral blood mononuclear cells (PBMC) from the same Asian individual whose genome was deciphered in the YH project. PBMC constitute an important source for clinical blood tests world-wide. We found that 68.4% of CpG sites and 80% displayed allele-specific expression (ASE). These data demonstrate that ASM is a recurrent phenomenon and is highly correlated with ASE in human PBMCs. Together with recently reported similar studies, our study provides a comprehensive resource for future epigenomic research and confirms new sequencing technology as a paradigm for large-scale epigenomics studies.
336 citations
••
TL;DR: The methylome of a model insect, the silkworm Bombyx mori, is surveyed at single-base resolution using Illumina high-throughput bisulfite sequencing (MethylC-Seq), finding that transposable elements, promoters and ribosomal DNAs are hypomethylated, but in contrast, genomic loci matching small RNAs in gene bodies are densely methylated.
Abstract: Epigenetic regulation in insects may have effects on diverse biological processes. Here we survey the methylome of a model insect, the silkworm Bombyx mori, at single-base resolution using Illumina high-throughput bisulfite sequencing (MethylC-Seq). We conservatively estimate that 0.11% of genomic cytosines are methylcytosines, all of which probably occur in CG dinucleotides. CG methylation is substantially enriched in gene bodies and is positively correlated with gene expression levels, suggesting it has a positive role in gene transcription. We find that transposable elements, promoters and ribosomal DNAs are hypomethylated, but in contrast, genomic loci matching small RNAs in gene bodies are densely methylated. This work contributes to our understanding of epigenetics in insects, and in contrast to previous studies of the highly methylated genomes of Arabidopsis and human, demonstrates a strategy for sequencing the epigenomes of organisms such as insects that have low levels of methylation.
321 citations
••
TL;DR: Exome sequencing of 200 individuals from Denmark with targeted capture of 18,654 coding genes and sequence coverage of each individual exome at an average depth of 12-fold is reported, suggesting that deleterious substitutions are primarily recessive.
Abstract: Targeted capture combined with massively parallel exome sequencing is a promising approach to identify genetic variants implicated in human traits. We report exome sequencing of 200 individuals from Denmark with targeted capture of 18,654 coding genes and sequence coverage of each individual exome at an average depth of 12-fold. On average, about 95% of the target regions were covered by at least one read. We identified 121,870 SNPs in the sample population, including 53,081 coding SNPs (cSNPs). Using a statistical method for SNP calling and an estimation of allelic frequencies based on our population data, we derived the allele frequency spectrum of cSNPs with a minor allele frequency greater than 0.02. We identified a 1.8-fold excess of deleterious, non-syonomyous cSNPs over synonymous cSNPs in the low-frequency range (minor allele frequencies between 2% and 5%). This excess was more pronounced for X-linked SNPs, suggesting that deleterious substitutions are primarily recessive.
319 citations
••
TL;DR: The finding of TGM6 as a novel causative gene of spinocerebellar ataxia illustrates whole-exome sequencing of affected individuals from one family as an effective and cost efficient method for mapping genes of rare Mendelian disorders and the use of linkage analysis and exome sequencing for further improving efficiency.
Abstract: Autosomal-dominant spinocerebellar ataxias constitute a large, heterogeneous group of progressive neurodegenerative diseases with multiple types. To date, classical genetic studies have revealed 31 distinct genetic forms of spinocerebellar ataxias and identified 19 causative genes. Traditional positional cloning strategies, however, have limitations for finding causative genes of rare Mendelian disorders. Here, we used a combined strategy of exome sequencing and linkage analysis to identify a novel spinocerebellar ataxia causative gene, TGM6. We sequenced the whole exome of four patients in a Chinese four-generation spinocerebellar ataxia family and identified a missense mutation, c.1550T-G transition (L517W), in exon 10 of TGM6. This change is at a highly conserved position, is predicted to have a functional impact, and completely cosegregated with the phenotype. The exome results were validated using linkage analysis. The mutation we identified using exome sequencing was located in the same region (20p13-12.2) as that identified by linkage analysis, which cross-validated TGM6 as the causative spinocerebellar ataxia gene in this family. We also showed that the causative gene could be mapped by a combined method of linkage analysis and sequencing of one sample from the family. We further confirmed our finding by identifying another missense mutation c.980A-G transition (D327G) in exon seven of TGM6 in an additional spinocerebellar ataxia family, which also cosegregated with the phenotype. Both mutations were absent in 500 normal unaffected individuals of matched geographical ancestry. The finding of TGM6 as a novel causative gene of spinocerebellar ataxia illustrates whole-exome sequencing of affected individuals from one family as an effective and cost efficient method for mapping genes of rare Mendelian disorders and the use of linkage analysis and exome sequencing for further improving efficiency.
283 citations
•
••
TL;DR: It is estimated that a complete human pan-genome would contain ∼19–40 Mb of novel sequence not present in the extant reference genome, indicating the importance of using complete genome sequencing and de novo assembly.
Abstract: Here we integrate the de novo assembly of an Asian and an African genome with the NCBI reference human genome, as a step toward constructing the human pan-genome. We identified ∼5 Mb of novel sequences not present in the reference genome in each of these assemblies. Most novel sequences are individual or population specific, as revealed by their comparison to all available human DNA sequence and by PCR validation using the human genome diversity cell line panel. We found novel sequences present in patterns consistent with known human migration paths. Cross-species conservation analysis of predicted genes indicated that the novel sequences contain potentially functional coding regions. We estimate that a complete human pan-genome would contain ∼19–40 Mb of novel sequence not present in the extant reference genome. The extensive amount of novel sequence contributing to the genetic variation of the pan-genome indicates the importance of using complete genome sequencing and de novo assembly.
••
TL;DR: It is found that 3gigabases (Gbp) 45bp paired-end MeDIP-seq or MBD-seq uniquely mapped reads is the minimum requirement and cost-effective strategy for methylome pattern analysis.
••
TL;DR: These results provide guidelines for researchers who are developing association mapping studies based on next‐generation sequencing and suggest that with a fixed cost, sequencing many individuals at a more shallow depth with larger pool size achieves higher power than sequencing a small number of individuals in higher depth with smaller pool size, even in the presence of high error rates.
Abstract: Most common hereditary diseases in humans are complex and multifactorial. Large-scale genome-wide association studies based on SNP genotyping have only identified a small fraction of the heritable variation of these diseases. One explanation may be that many rare variants (a minor allele frequency, MAF <5%), which are not included in the common genotyping platforms, may contribute substantially to the genetic variation of these diseases. Next-generation sequencing, which would allow the analysis of rare variants, is now becoming so cheap that it provides a viable alternative to SNP genotyping. In this paper, we present cost-effective protocols for using next-generation sequencing in association mapping studies based on pooled and un-pooled samples, and identify optimal designs with respect to total number of individuals, number of individuals per pool, and the sequencing coverage. We perform a small empirical study to evaluate the pooling variance in a realistic setting where pooling is combined with exon-capturing. To test for associations, we develop a likelihood ratio statistic that accounts for the high error rate of next-generation sequencing data. We also perform extensive simulations to determine the power and accuracy of this method. Overall, our findings suggest that with a fixed cost, sequencing many individuals at a more shallow depth with larger pool size achieves higher power than sequencing a small number of individuals in higher depth with smaller pool size, even in the presence of high error rates. Our results provide guidelines for researchers who are developing association mapping studies based on next-generation sequencing.
••
TL;DR: Two public short-read de novo assembly applications that can handle human genomes, ABySS and SOAPdenovo are described.
Abstract: Recent studies in human genomes have demonstrated the use of de novo assemblies to identify genetic variations that are difficult for mapping-based approaches. Construction of multiple human genome assemblies is enabled by massively parallel sequencing, but a conventional bioinformatics solution is costly and slow, creating bottle-necks in the process. This review describes two public short-read de novo assembly applications that can handle human genomes, ABySS and SOAPdenovo. It also discusses the technical aspects and future challenges of human genome de novo assembly by short reads.
•
15 Dec 2010
TL;DR: In this paper, a method and a system for detecting a polymorphism locus of a genome target region is described, which comprises the steps of obtaining an exon sequencing result, removing redundancy and sequencing, carrying out statistic analysis I, detecting an SNP (Single Nucleotide Polymorphism) locus, filtering the SNP locus and noting the SNP.
Abstract: The invention discloses a method and a system for detecting a polymorphism locus of a genome target region. The method comprises the steps of: obtaining an exon sequencing result, removing redundancy and sequencing, carrying out statistic analysis I, detecting an SNP (Single Nucleotide Polymorphism) locus, filtering the SNP locus, carrying out statistic analysis II and noting the SNP. The SNP analysis can be carried out by sequencing a genome specific region; and the invention has the advantages of high detection accuracy of SNP result, high speed and low cost, and can realize the automation in the whole process, i.e. the high-quality SNP locus is automatically generated by using original sequencing data as a data source, and the SNP locus can be noted and classified.
••
University of Copenhagen1, Beijing Institute of Genomics2, Chinese Academy of Sciences3, Shanghai Jiao Tong University4, Fudan University5, Kunming Institute of Zoology6, Shenzhen University7, Chengdu Research Base of Giant Panda Breeding8, Wellcome Trust9, University of Toronto10, University of California, Berkeley11, Southeast University12, University of Hong Kong13, Sun Yat-sen University14, University of Veterinary Medicine Vienna15, Cardiff University16, Comenius University in Bratislava17, Sichuan University18, South China University of Technology19, University of Alberta20, University of Washington21
TL;DR: This corrects the article to show that the Higgs boson genome is a “spatially aggregating ‘spatiotemporal ’ organisation’, rather than a ‘cell-based’ organisation, which is more closely related to the immune system.
Abstract: Nature 463, 311–317 (2010) In this Article, the Latin species name of the giant panda was written incorrectly as Ailuropoda melanoleura. The correct name is Ailuropoda melanoleuca.
01 Jan 2010
TL;DR: Here, the de novo assembly of an Asian and an African genome with the NCBI reference human genome is integrated, as a step toward constructing the human pan-genome.
Abstract: Here we integrate the de novo assembly of an Asian and an African genome with the NCBI reference human genome, as a step toward constructing the human pan-genome. We identified ~5 Mb of novel sequences not present in the reference genome in each of these assemblies. Most novel sequences are individual or population specific, as revealed by their comparison to all available human DNA sequence and by PCR validation using the human genome diversity cell line panel. We found novel sequences present in patterns consistent with known human migration paths. Cross-species conservation analysis of predicted genes indicated that the novel sequences contain potentially functional coding regions. We estimate that a complete human pan-genome would contain ~19–40 Mb of novel sequence not present in the extant reference genome. The extensive amount of novel sequence contributing to the genetic variation of the pan-genome indicates the importance of using complete genome sequencing and de novo assembly. The Human Genome Project 1 established the foundation for human genomics studies. Subsequent analyses unveiled genetic variations and identified their effects on phenotypic diversity and differences in disease susceptibility 2 . Guided by the National Center for Biotechnology Information (NCBI) reference genome, initial studies of human genetic variation focused largely on identifying 3 and cataloging 4,5 single-nucleotide polymorphisms (SNPs) and studying their association to human diseases 6 . Structural variation (which is thought to contribute more variant sequences than SNPs) has also been extensively identified and analyzed in the human genome 7–10 . The availability of a number of individual human genomes 11–15 has provided an unprecedented opportunity to investigate detailed genetic differences at the individual level. Preliminary analyses have revealed that these genomes contain sequences that could not be mapped onto the human reference genome (novel sequences), resulting in the proposal that the majority of these sequences likely belong to the gap regions in the current version of the human genome assembly 12 . When fosmid clones from HapMap samples were sequenced, 525 sequences were identified that mapped instead to highly poly
••
TL;DR: The understanding that the majority of the current population of the Tibetan plateau may trace their genetic ancestry back to quite recent immigrants into Tibet, even though humans have lived in Tibet for a much longer time—possibly with some continuity of culture—is important for understanding the difference between inferencesbased on archaeology and inferences based on genetics.
Abstract: We thank Brantingham et al. for their interest in our study; we agree that both molecular and archaeological evidence should be used to understand the demographic history of the Tibetan people. Our Report focused not on the demographic history of the Tibetan population, but rather the selection acting on specific putatively adaptive mutations segregating in the Tibetan population. We included some limited demographic analyses because they helped illuminate our results regarding natural selection. The real demographic model is clearly likely to be more complex than the simple models of two populations diverging from each other. For example, Zhao et al. ([ 1 ][1]) used mitochondrial DNA to argue that late settlers of the Tibetan plateau may not have entirely replaced the original population but that a small proportion of them carry mitochondrial DNA lineages tracing back to Late Paleolithic inhabitants on the plateau. If this is the case, even if the EPAS1 variant was present in the early inhabitants of Tibet, strong selection would be needed to increase its frequency in the modern Tibetan gene pool. The understanding that the majority of the current population of the Tibetan plateau may trace their genetic ancestry back to quite recent immigrants into Tibet, even though humans have lived in Tibet for a much longer time—possibly with some continuity of culture—is important for understanding the difference between inferences based on archaeology and inferences based on genetics.
1. [↵][2] 1. M. Zhao 2. et al
., Proc. Natl. Acad. Sci. U.S.A. 106, 21230 (2009).
[OpenUrl][3][Abstract/FREE Full Text][4]
[1]: #ref-1
[2]: #xref-ref-1-1 "View reference 1 in text"
[3]: {openurl}?query=rft.jtitle%253DProc.%2BNatl.%2BAcad.%2BSci.%2BU.S.A.%26rft_id%253Dinfo%253Adoi%252F10.1073%252Fpnas.0907844106%26rft_id%253Dinfo%253Apmid%252F19955425%26rft.genre%253Darticle%26rft_val_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Ajournal%26ctx_ver%253DZ39.88-2004%26url_ver%253DZ39.88-2004%26url_ctx_fmt%253Dinfo%253Aofi%252Ffmt%253Akev%253Amtx%253Actx
[4]: /lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMjoiMTA2LzUwLzIxMjMwIjtzOjQ6ImF0b20iO3M6MjU6Ii9zY2kvMzI5LzU5OTgvMTQ2Ny4yLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==