scispace - formally typeset
Search or ask a question

Showing papers by "Richard K. Wilson published in 2009"


Journal ArticleDOI
Patrick S. Schnable1, Doreen Ware2, Robert S. Fulton3, Joshua C. Stein2  +156 moreInstitutions (18)
20 Nov 2009-Science
TL;DR: The sequence of the maize genome reveals it to be the most complex genome known to date and the correlation of methylation-poor regions with Mu transposon insertions and recombination and how uneven gene losses between duplicated regions were involved in returning an ancient allotetraploid to a genetically diploid state is reported.
Abstract: We report an improved draft nucleotide sequence of the 2.3-gigabase genome of maize, an important crop plant and model for biological research. Over 32,000 genes were predicted, of which 99.8% were placed on reference chromosomes. Nearly 85% of the genome is composed of hundreds of families of transposable elements, dispersed nonuniformly across the genome. These were responsible for the capture and amplification of numerous gene fragments and affect the composition, sizes, and positions of centromeres. We also report on the correlation of methylation-poor regions with Mu transposon insertions and recombination, and copy number variants with insertions and/or deletions, as well as how uneven gene losses between duplicated regions were involved in returning an ancient allotetraploid to a genetically diploid state. These analyses inform and set the stage for further investigations to improve our understanding of the domestication and agricultural improvements of maize.

3,761 citations


Journal ArticleDOI
TL;DR: By comparing the sequences of tumor and skin genomes of a patient with AML-M1, recurring mutations that may be relevant for pathogenesis are identified.
Abstract: From the Departments of Genetics (E.R.M., L.D., V.J.M., R.K.W., T.J.L.), Medicine (R.E.R., P.W., M.H.T., S.H., W.D.S., D.C.L., M.J.W., T.A.G., J.F.D., T.J.L.), and Pathology and Immunology (J.E.P., M.A.W., R.N.); the Genome Center (E.R.M., L.D., D.J.D., D.E.L., M.D.M., K.C., D.C.K., R.S.F., K.D.D., S.D.M., L.A.F., D.P.L., V.J.M., R.M.A.,

2,151 citations


Journal ArticleDOI
TL;DR: The algorithm BreakDancer predicts a wide variety of structural variants including insertion-deletions (indels), inversions and translocations and sensitively and accurately detected indels ranging from 10 base pairs to 1 megabase pair that are difficult to detect via a single conventional approach.
Abstract: This software package provides genome-wide detection of structural variants (insertions, deletions, inversions and inter- and intrachromosomal translocations) from 50-base-pair paired-end reads. The sizes of the detected variants vary from 10 base pairs to 1 megabase pair.

1,418 citations


Journal ArticleDOI
TL;DR: VarScan is presented, an open source tool for variant detection that is compatible with several short read aligners that demonstrates its ability to detect SNPs and indels with high sensitivity and specificity, in both Roche/454 sequencing of individuals and deep Illumina/Solexa sequencing of pooled samples.
Abstract: Summary: Massively parallel sequencing technologies hold incredible promise for the study of DNA sequence variation, particularly the identification of variants affecting human disease. The unprecedented throughput and relatively short read lengths of Roche/454, Illumina/Solexa, and other platforms have spurred development of a new generation of sequence alignment algorithms. Yet detection of sequence variants based on short read alignments remains challenging, and most currently available tools are limited to a single platform or aligner type. We present VarScan, an open source tool for variant detection that is compatible with several short read aligners. We demonstrate VarScan’s ability to detect SNPs and indels with high sensitivity and specificity, in both Roche/454 sequencing of individuals and deep Illumina/Solexa sequencing of pooled samples. Availability and Implementation: Source code and documentation freely available at http://genome.wustl.edu/tools/cancer-genomics, implemented as a Perl package and supported on Linux/UNIX, MS Windows and Mac OSX.

1,250 citations


Journal ArticleDOI
TL;DR: A simplified model of the human gut microbiota illustrates niche specialization and functional redundancy within members of its major bacterial phyla, and the importance of host glycans as a nutrient foundation that ensures ecosystem stability.
Abstract: The adult human distal gut microbial community is typically dominated by 2 bacterial phyla (divisions), the Firmicutes and the Bacteroidetes. Little is known about the factors that govern the interactions between their members. Here, we examine the niches of representatives of both phyla in vivo. Finished genome sequences were generated from Eubacterium rectale and E. eligens, which belong to Clostridium Cluster XIVa, one of the most common gut Firmicute clades. Comparison of these and 25 other gut Firmicutes and Bacteroidetes indicated that the Firmicutes possess smaller genomes and a disproportionately smaller number of glycan-degrading enzymes. Germ-free mice were then colonized with E. rectale and/or a prominent human gut Bacteroidetes, Bacteroides thetaiotaomicron, followed by whole-genome transcriptional profiling, high-resolution proteomic analysis, and biochemical assays of microbial-microbial and microbial-host interactions. B. thetaiotaomicron adapts to E. rectale by up-regulating expression of a variety of polysaccharide utilization loci encoding numerous glycoside hydrolases, and by signaling the host to produce mucosal glycans that it, but not E. rectale, can access. E. rectale adapts to B. thetaiotaomicron by decreasing production of its glycan-degrading enzymes, increasing expression of selected amino acid and sugar transporters, and facilitating glycolysis by reducing levels of NADH, in part via generation of butyrate from acetate, which in turn is used by the gut epithelium. This simplified model of the human gut microbiota illustrates niche specialization and functional redundancy within members of its major bacterial phyla, and the importance of host glycans as a nutrient foundation that ensures ecosystem stability.

670 citations


Journal ArticleDOI
TL;DR: The use of an unbiased high-resolution genomic screen identified many genes not previously implicated in AML that may be relevant for pathogenesis, along with many known oncogenes and tumor suppressor genes.
Abstract: Cytogenetic analysis of acute myeloid leukemia (AML) cells has accelerated the identification of genes important for AML pathogenesis. To complement cytogenetic studies and to identify genes altered in AML genomes, we performed genome-wide copy number analysis with paired normal and tumor DNA obtained from 86 adult patients with de novo AML using 1.85 million feature SNP arrays. Acquired copy number alterations (CNAs) were confirmed using an ultra-dense array comparative genomic hybridization platform. A total of 201 somatic CNAs were found in the 86 AML genomes (mean, 2.34 CNAs per genome), with French-American-British system M6 and M7 genomes containing the most changes (10–29 CNAs per genome). Twenty-four percent of AML patients with normal cytogenetics had CNA, whereas 40% of patients with an abnormal karyotype had additional CNA detected by SNP array, and several CNA regions were recurrent. The mRNA expression levels of 57 genes were significantly altered in 27 of 50 recurrent CNA regions <5 megabases in size. A total of 8 uniparental disomy (UPD) segments were identified in the 86 genomes; 6 of 8 UPD calls occurred in samples with a normal karyotype. Collectively, 34 of 86 AML genomes (40%) contained alterations not found with cytogenetics, and 98% of these regions contained genes. Of 86 genomes, 43 (50%) had no CNA or UPD at this level of resolution. In this study of 86 adult AML genomes, the use of an unbiased high-resolution genomic screen identified many genes not previously implicated in AML that may be relevant for pathogenesis, along with many known oncogenes and tumor suppressor genes.

241 citations


Journal ArticleDOI
12 Feb 2009-Nature
TL;DR: The results suggest that the evolutionary properties of copy-number mutation differ significantly from other forms of genetic mutation and, in contrast to the hominid slowdown of single-base-pair mutations, there has been a genomic burst of duplication activity at this period during human evolution.
Abstract: It is generally accepted that the extent of phenotypic change between human and great apes is dissonant with the rate of molecular change. Between these two groups, proteins are virtually identical, cytogenetically there are few rearrangements that distinguish ape-human chromosomes, and rates of single-base-pair change and retrotransposon activity have slowed particularly within hominid lineages when compared to rodents or monkeys. Studies of gene family evolution indicate that gene loss and gain are enriched within the primate lineage. Here, we perform a systematic analysis of duplication content of four primate genomes (macaque, orang-utan, chimpanzee and human) in an effort to understand the pattern and rates of genomic duplication during hominid evolution. We find that the ancestral branch leading to human and African great apes shows the most significant increase in duplication activity both in terms of base pairs and in terms of events. This duplication acceleration within the ancestral species is significant when compared to lineage-specific rate estimates even after accounting for copy-number polymorphism and homoplasy. We discover striking examples of recurrent and independent gene-containing duplications within the gorilla and chimpanzee that are absent in the human lineage. Our results suggest that the evolutionary properties of copy-number mutation differ significantly from other forms of genetic mutation and, in contrast to the hominid slowdown of single-base-pair mutations, there has been a genomic burst of duplication activity at this period during human evolution.

227 citations


Journal ArticleDOI
TL;DR: Five species of Saccharomycetaceae, a large subdivision of hemiascomycetes, that are called "protoploid" because they diverged from the S. cerevisiae lineage prior to its genome duplication, are concentrated here on.
Abstract: Our knowledge of yeast genomes remains largely dominated by the extensive studies on Saccharomyces cerevisiae and the consequences of its ancestral duplication, leaving the evolution of the entire class of hemiascomycetes only partly explored. We concentrate here on five species of Saccharomycetaceae, a large subdivision of hemiascomycetes, that we call "protoploid" because they diverged from the S. cerevisiae lineage prior to its genome duplication. We determined the complete genome sequences of three of these species: Kluyveromyces (Lachancea) thermotolerans and Saccharomyces (Lachancea) kluyveri (two members of the newly described Lachancea clade), and Zygosaccharomyces rouxii. We included in our comparisons the previously available sequences of Kluyveromyces lactis and Ashbya (Eremothecium) gossypii. Despite their broad evolutionary range and significant individual variations in each lineage, the five protoploid Saccharomycetaceae share a core repertoire of approximately 3300 protein families and a high degree of conserved synteny. Synteny blocks were used to define gene orthology and to infer ancestors. Far from representing minimal genomes without redundancy, the five protoploid yeasts contain numerous copies of paralogous genes, either dispersed or in tandem arrays, that, altogether, constitute a third of each genome. Ancient, conserved paralogs as well as novel, lineage-specific paralogs were identified.

221 citations


Journal ArticleDOI
TL;DR: Several areas within cancer genomics are being transformed by the application of new technology, and in the process are dramatically expanding the authors' understanding of this disease.
Abstract: A genomic era of cancer studies is developing rapidly, fueled by the emergence of next-generation sequencing technologies that provide exquisite sensitivity and resolution. This article discusses several areas within cancer genomics that are being transformed by the application of new technology, and in the process are dramatically expanding our understanding of this disease. Although, we anticipate that there will be many exciting discoveries in the near future, the ultimate success of these endeavors rests on our ability to translate what is learned into better diagnosis, treatment and prevention of cancer.

217 citations


Journal ArticleDOI
TL;DR: All available physical, sequence, genetic, and optical data were used to generate a golden path (AGP) of chromosome-based pseudomolecules, herein referred to as the B73 Reference Genome Sequence version 1 (B73 RefGen_v1).
Abstract: Maize is a major cereal crop and an important model system for basic biological research. Knowledge gained from maize research can also be used to genetically improve its grass relatives such as sorghum, wheat, and rice. The primary objective of the Maize Genome Sequencing Consortium (MGSC) was to generate a reference genome sequence that was integrated with both the physical and genetic maps. Using a previously published integrated genetic and physical map, combined with in-coming maize genomic sequence, new sequence-based genetic markers, and an optical map, we dynamically picked a minimum tiling path (MTP) of 16,910 bacterial artificial chromosome (BAC) and fosmid clones that were used by the MGSC to sequence the maize genome. The final MTP resulted in a significantly improved physical map that reduced the number of contigs from 721 to 435, incorporated a total of 8,315 mapped markers, and ordered and oriented the majority of FPC contigs. The new integrated physical and genetic map covered 2,120 Mb (93%) of the 2,300-Mb genome, of which 405 contigs were anchored to the genetic map, totaling 2,103.4 Mb (99.2% of the 2,120 Mb physical map). More importantly, 336 contigs, comprising 94.0% of the physical map ( approximately 1,993 Mb), were ordered and oriented. Finally we used all available physical, sequence, genetic, and optical data to generate a golden path (AGP) of chromosome-based pseudomolecules, herein referred to as the B73 Reference Genome Sequence version 1 (B73 RefGen_v1).

106 citations


Journal ArticleDOI
TL;DR: The results demonstrate the feasibility of refining the B73 RefGen_v1 genome assembly by incorporating optical map, high-resolution genetic map, and comparative genomic data sets and improvements in gene and repeat annotation will serve to promote future functional genomic and phylogenomic research in maize and other grasses.
Abstract: Most of our understanding of plant genome structure and evolution has come from the careful annotation of small (e.g., 100 kb) sequenced genomic regions or from automated annotation of complete genome sequences. Here, we sequenced and carefully annotated a contiguous 22 Mb region of maize chromosome 4 using an improved pseudomolecule for annotation. The sequence segment was comprehensively ordered, oriented, and confirmed using the maize optical map. Nearly 84% of the sequence is composed of transposable elements (TEs) that are mostly nested within each other, of which most families are low-copy. We identified 544 gene models using multiple levels of evidence, as well as five miRNA genes. Gene fragments, many captured by TEs, are prevalent within this region. Elimination of gene redundancy from a tetraploid maize ancestor that originated a few million years ago is responsible in this region for most disruptions of synteny with sorghum and rice. Consistent with other sub-genomic analyses in maize, small RNA mapping showed that many small RNAs match TEs and that most TEs match small RNAs. These results, performed on approximately 1% of the maize genome, demonstrate the feasibility of refining the B73 RefGen_v1 genome assembly by incorporating optical map, high-resolution genetic map, and comparative genomic data sets. Such improvements, along with those of gene and repeat annotation, will serve to promote future functional genomic and phylogenomic research in maize and other grasses.

Journal ArticleDOI
TL;DR: Analysis of 24 synteny breakpoints in the white-cheeked gibbon provides a model for a replication-dependent repair mechanism for double-strand breaks (DSBs) at rearrangement sites and insights into the structure and formation of primate segmental duplications at sites of genomic rearrangements during evolution.
Abstract: The gibbon genome exhibits extensive karyotypic diversity with an increased rate of chromosomal rearrangements during evolution. In an effort to understand the mechanistic origin and implications of these rearrangement events, we sequenced 24 synteny breakpoint regions in the white-cheeked gibbon (Nomascus leucogenys, NLE) in the form of high-quality BAC insert sequences (4.2 Mbp). While there is a significant deficit of breakpoints in genes, we identified seven human gene structures involved in signaling pathways (DEPDC4, GNG10), phospholipid metabolism (ENPP5, PLSCR2), beta-oxidation (ECH1), cellular structure and transport (HEATR4), and transcription (ZNF461), that have been disrupted in the NLE gibbon lineage. Notably, only three of these genes show the expected evolutionary signatures of pseudogenization. Sequence analysis of the breakpoints suggested both nonclassical nonhomologous end-joining (NHEJ) and replication-based mechanisms of rearrangement. A substantial number (11/24) of human-NLE gibbon breakpoints showed new insertions of gibbon-specific repeats and mosaic structures formed from disparate sequences including segmental duplications, LINE, SINE, and LTR elements. Analysis of these sites provides a model for a replication-dependent repair mechanism for double-strand breaks (DSBs) at rearrangement sites and insights into the structure and formation of primate segmental duplications at sites of genomic rearrangements during evolution.

Journal ArticleDOI
TL;DR: The systematic karyotyping of bone marrow cells was the first genomic approach used to personalize therapy for patients with leukemia and has the potential to be rapidly extended with the use of whole-genome sequencing approaches for cancer, which are now possible.
Abstract: The systematic karyotyping of bone marrow cells was the first genomic approach used to personalize therapy for patients with leukemia. The paradigm established by cytogenetic studies in leukemia (from gene discovery to therapeutic intervention) now has the potential to be rapidly extended with the use of whole-genome sequencing approaches for cancer, which are now possible. We are now entering a period of exponential growth in cancer gene discovery that will provide many novel therapeutic targets for a large number of cancer types. Establishing the pathogenetic relevance of individual mutations is a major challenge that must be solved. However, after thousands of cancer genomes have been sequenced, the genetic rules of cancer will become known and new approaches for diagnosis, risk stratification and individualized treatment of cancer patients will surely follow.

Journal ArticleDOI
TL;DR: This large-scale expressed sequence tag (EST) analysis effort enables gene discovery and development of microsatellite markers and will enable genetic mapping and population genetic studies.
Abstract: The entomopathogenic nematode Heterorhabditis bacteriophora and its symbiotic bacterium, Photorhabdus luminescens, are important biological control agents of insect pests. This nematode-bacterium-insect association represents an emerging tripartite model for research on mutualistic and parasitic symbioses. Elucidation of mechanisms underlying these biological processes may serve as a foundation for improving the biological control potential of the nematode-bacterium complex. This large-scale expressed sequence tag (EST) analysis effort enables gene discovery and development of microsatellite markers. These ESTs will also aid in the annotation of the upcoming complete genome sequence of H. bacteriophora. A total of 31,485 high quality ESTs were generated from cDNA libraries of the adult H. bacteriophora TTO1 strain. Cluster analysis revealed the presence of 3,051 contigs and 7,835 singletons, representing 10,886 distinct EST sequences. About 72% of the distinct EST sequences had significant matches (E value < 1e-5) to proteins in GenBank's non-redundant (nr) and Wormpep190 databases. We have identified 12 ESTs corresponding to 8 genes potentially involved in RNA interference, 22 ESTs corresponding to 14 genes potentially involved in dauer-related processes, and 51 ESTs corresponding to 27 genes potentially involved in defense and stress responses. Comparison to ESTs and proteins of free-living nematodes led to the identification of 554 parasitic nematode-specific ESTs in H. bacteriophora, among which are those encoding F-box-like/WD-repeat protein theromacin, Bax inhibitor-1-like protein, and PAZ domain containing protein. Gene Ontology terms were assigned to 6,685 of the 10,886 ESTs. A total of 168 microsatellite loci were identified with primers designable for 141 loci. A total of 10,886 distinct EST sequences were identified from adult H. bacteriophora cDNA libraries. BLAST searches revealed ESTs potentially involved in parasitism, RNA interference, defense responses, stress responses, and dauer-related processes. The putative microsatellite markers identified in H. bacteriophora ESTs will enable genetic mapping and population genetic studies. These genomic resources provide the material base necessary for genome annotation, microarray development, and in-depth gene functional analysis.

Journal ArticleDOI
TL;DR: This study identified and characterized the molecular determinants that help in defining the phylum Nematoda, and therefore improved the understanding of nematode protein evolution and provided novel insights for the development of next generation parasite control strategies.
Abstract: Nematoda diverged from other animals between 600–1,200 million years ago and has become one of the most diverse animal phyla on earth. Most nematodes are free-living animals, but many are parasites of plants and animals including humans, posing major ecological and economical challenges around the world. We investigated phylum-specific molecular characteristics in Nematoda by exploring over 214,000 polypeptides from 32 nematode species including 27 parasites. Over 50,000 nematode protein families were identified based on primary sequence, including ~10% with members from at least three different species. Nearly 1,600 of the multi-species families did not share homology to Pfam domains, including a total of 758 restricted to Nematoda. Majority of the 462 families that were conserved among both free-living and parasitic species contained members from multiple nematode clades, yet ~90% of the 296 parasite-specific families originated only from a single clade. Features of these protein families were revealed through extrapolation of essential functions from observed RNAi phenotypes in C. elegans, bioinformatics-based functional annotations, identification of distant homology based on protein folds, and prediction of expression at accessible nematode surfaces. In addition, we identified a group of nematode-restricted sequence features in energy-generating electron transfer complexes as potential targets for new chemicals with minimal or no toxicity to the host. This study identified and characterized the molecular determinants that help in defining the phylum Nematoda, and therefore improved our understanding of nematode protein evolution and provided novel insights for the development of next generation parasite control strategies.

Journal ArticleDOI
TL;DR: Optimal project-wide redundancy and sample size are shown to be inversely proportional to the desired variant frequency, and optimization principles reported here dramatically simplify the design process and should be broadly useful as rare-variant projects become both more important and routine in the future.
Abstract: Rare population variants are known to have important biomedical implications, but their systematic discovery has only recently been enabled by advances in DNA sequencing. The design process of a discovery project remains formidable, being limited to ad hoc mixtures of extensive computer simulation and pilot sequencing. Here, the task is examined from a general mathematical perspective. We pose and solve the population sequencing design problem and subsequently apply standard optimization techniques that maximize the discovery probability. Emphasis is placed on cases whose discovery thresholds place them within reach of current technologies. We find that parameter values characteristic of rare-variant projects lead to a general, yet remarkably simple set of optimization rules. Specifically, optimal processing occurs at constant values of the per-sample redundancy, refuting current notions that sample size should be selected outright. Optimal project-wide redundancy and sample size are then shown to be inversely proportional to the desired variant frequency. A second family of constants governs these relationships, permitting one to immediately establish the most efficient settings for a given set of discovery conditions. Our results largely concur with the empirical design of the Thousand Genomes Project, though they furnish some additional refinement. The optimization principles reported here dramatically simplify the design process and should be broadly useful as rare-variant projects become both more important and routine in the future.

Journal ArticleDOI
TL;DR: This study presents the first large-scale genomic survey of O. ostertagi by the analysis of expressed transcripts from three stages of the parasite: third-stage larvae, fourth- stage larvae and adult worms, and identifies transcripts that can facilitate the design of control strategies and vaccine programs.

Journal ArticleDOI
TL;DR: The statistical theory characterizing the length-discrepancy scheme for Gaussian libraries resolves several outstanding issues and furnishes a general methodology for designing future projects from the standpoint of a spectrum-wide constant risk.
Abstract: Structural variations in the form of DNA insertions and deletions are an important aspect of human genetics and especially relevant to medical disorders. Investigations have shown that such events can be detected via tell-tale discrepancies in the aligned lengths of paired-end DNA sequencing reads. Quantitative aspects underlying this method remain poorly understood, despite its importance and conceptual simplicity. We report the statistical theory characterizing the length-discrepancy scheme for Gaussian libraries, including coverage-related effects that preceding models are unable to account for. Deletion and insertion statistics both depend heavily on physical coverage, but otherwise differ dramatically, refuting a commonly held doctrine of symmetry. Specifically, coverage restrictions render insertions much more difficult to capture. Increased read length has the counterintuitive effect of worsening insertion detection characteristics of short inserts. Variance in library insert length is also a critical factor here and should be minimized to the greatest degree possible. Conversely, no significant improvement would be realized in lowering fosmid variances beyond current levels. Detection power is examined under a straightforward alternative hypothesis and found to be generally acceptable. We also consider the proposition of characterizing variation over the entire spectrum of variant sizes under constant risk of false-positive errors. At 1% risk, many designs will leave a significant gap in the 100 to 200 bp neighborhood, requiring unacceptably high redundancies to compensate. We show that a few modifications largely close this gap and we give a few examples of feasible spectrum-covering designs. The theory resolves several outstanding issues and furnishes a general methodology for designing future projects from the standpoint of a spectrum-wide constant risk.

Journal Article
TL;DR: In this article, the authors explore variation on an intermediate scale, particularly insertions, deletions and inversions affecting from a few thousand to a few million base pairs, and find that 50% were seen in more than one individual and nearly half lay outside regions of the genome previously described as structurally variant.
Abstract: Genetic variation among individual humans occurs on many different scales, ranging from gross alterations in the human karyotype to single nucleotide changes. Here we explore variation on an intermediate scale―particularly insertions, deletions and inversions affecting from a few thousand to a few million base pairs. We employed a clone-based method to interrogate this intermediate structural variation in eight individuals of diverse geographic ancestry. Our analysis provides a comprehensive overview of the normal pattern of structural variation present in these genomes, refining the location of 1,695 structural variants. We find that 50% were seen in more than one individual and that nearly half lay outside regions of the genome previously described as structurally variant. We discover 525 new insertion sequences that are not present in the human reference genome and show that many of these are variable in copy number between individuals. Complete sequencing of 261 structural variants reveals considerable locus complexity and provides insights into the different mutational processes that have shaped the human genome. These data provide the first high-resolution sequence map of human structural variation―a standard for genotyping platforms and a prelude to future individual genome sequencing projects.

Journal ArticleDOI
20 Nov 2009-Blood
TL;DR: The results suggest that PML-RARA has an extended repertoire of genomic DNA binding sites compared to wild-type RARA, reflecting novel gain-of-function properties of the fusion protein.