scispace - formally typeset
Search or ask a question

Showing papers on "Genomics published in 2009"


Journal ArticleDOI
09 Apr 2009-Nature
TL;DR: This work has shown that the complete DNA sequence of large numbers of cancer genomes will be possible to obtain and will provide a detailed and comprehensive perspective on how individual cancers have developed.
Abstract: All cancers arise as a result of changes that have occurred in the DNA sequence of the genomes of cancer cells. Over the past quarter of a century much has been learnt about these mutations and the abnormal genes that operate in human cancers. We are now, however, moving into an era in which it will be possible to obtain the complete DNA sequence of large numbers of cancer genomes. These studies will provide us with a detailed and comprehensive perspective on how individual cancers have developed.

3,156 citations


Journal ArticleDOI
10 Sep 2009-Nature
TL;DR: It is shown that candidate genes for Mendelian disorders can be identified by exome sequencing of a small number of unrelated, affected individuals, and may be extendable to diseases with more complex genetics through larger sample sizes and appropriate weighting of non-synonymous variants by predicted functional impact.
Abstract: Genome-wide association studies suggest that common genetic variants explain only a modest fraction of heritable risk for common diseases, raising the question of whether rare variants account for a significant fraction of unexplained heritability. Although DNA sequencing costs have fallen markedly, they remain far from what is necessary for rare and novel variants to be routinely identified at a genome-wide scale in large cohorts. We have therefore sought to develop second-generation methods for targeted sequencing of all protein-coding regions ('exomes'), to reduce costs while enriching for discovery of highly penetrant variants. Here we report on the targeted capture and massively parallel sequencing of the exomes of 12 humans. These include eight HapMap individuals representing three populations, and four unrelated individuals with a rare dominantly inherited disorder, Freeman-Sheldon syndrome (FSS). We demonstrate the sensitive and specific identification of rare and common variants in over 300 megabases of coding sequence. Using FSS as a proof-of-concept, we show that candidate genes for Mendelian disorders can be identified by exome sequencing of a small number of unrelated, affected individuals. This strategy may be extendable to diseases with more complex genetics through larger sample sizes and appropriate weighting of non-synonymous variants by predicted functional impact.

1,846 citations


Journal ArticleDOI
19 Mar 2009-Nature
TL;DR: Rather than one or two domestication events leading to the extant baker’s yeasts, the population structure of S. cerevisiae consists of a few well-defined, geographically isolated lineages and many different mosaics of these lineages, supporting the idea that human influence provided the opportunity for cross-breeding and production of new combinations of pre-existing variations.
Abstract: Since the completion of the genome sequence of Saccharomyces cerevisiae in 1996 (refs 1, 2), there has been a large increase in complete genome sequences, accompanied by great advances in our understanding of genome evolution. Although little is known about the natural and life histories of yeasts in the wild, there are an increasing number of studies looking at ecological and geographic distributions, population structure and sexual versus asexual reproduction. Less well understood at the whole genome level are the evolutionary processes acting within populations and species that lead to adaptation to different environments, phenotypic differences and reproductive isolation. Here we present one- to fourfold or more coverage of the genome sequences of over seventy isolates of the baker's yeast S. cerevisiae and its closest relative, Saccharomyces paradoxus. We examine variation in gene content, single nucleotide polymorphisms, nucleotide insertions and deletions, copy numbers and transposable elements. We find that phenotypic variation broadly correlates with global genome-wide phylogenetic relationships. S. paradoxus populations are well delineated along geographic boundaries, whereas the variation among worldwide S. cerevisiae isolates shows less differentiation and is comparable to a single S. paradoxus population. Rather than one or two domestication events leading to the extant baker's yeasts, the population structure of S. cerevisiae consists of a few well-defined, geographically isolated lineages and many different mosaics of these lineages, supporting the idea that human influence provided the opportunity for cross-breeding and production of new combinations of pre-existing variations.

1,425 citations


Journal ArticleDOI
TL;DR: Development of commercial sequencing devices is reviewed, some European contributions to the field are mentioned, and presently commercially available very high-throughput DNA sequencing platforms, as well as techniques under development, are described and their applications in bio-medical fields discussed.

979 citations


Journal ArticleDOI
24 Dec 2009-Nature
TL;DR: The results strongly support the need for systematic ‘phylogenomic’ efforts to compile a phylogeny-driven ‘Genomic Encyclopedia of Bacteria and Archaea’ in order to derive maximum knowledge from existing microbial genome data as well as from genome sequences to come.
Abstract: Sequencing of bacterial and archaeal genomes has revolutionized our understanding of the many roles played by microorganisms. There are now nearly 1,000 completed bacterial and archaeal genomes available, most of which were chosen for sequencing on the basis of their physiology. As a result, the perspective provided by the currently available genomes is limited by a highly biased phylogenetic distribution. To explore the value added by choosing microbial genomes for sequencing on the basis of their evolutionary relationships, we have sequenced and analysed the genomes of 56 culturable species of Bacteria and Archaea selected to maximize phylogenetic coverage. Analysis of these genomes demonstrated pronounced benefits (compared to an equivalent set of genomes randomly selected from the existing database) in diverse areas including the reconstruction of phylogenetic history, the discovery of new protein families and biological properties, and the prediction of functions for known genes from other organisms. Our results strongly support the need for systematic phylogenomic efforts to compile a phylogeny-driven Genomic Encyclopedia of Bacteria and Archaea in order to derive maximum knowledge from existing microbial genome data as well as from genome sequences to come. © 2009 Macmillan Publishers Limited. All rights reserved.

928 citations


Journal ArticleDOI
05 Mar 2009-Nature
TL;DR: A robust pipeline is established for the discovery of novel gene chimaeras using high-throughput sequencing of cancer cells to discover novel gene fusions resulting in chimaeric transcripts in cancer cell lines and tumours.
Abstract: Recurrent gene fusions, typically associated with haematological malignancies and rare bone and soft-tissue tumours, have recently been described in common solid tumours. Here we use an integrative analysis of high-throughput long- and short-read transcriptome sequencing of cancer cells to discover novel gene fusions. As a proof of concept, we successfully used integrative transcriptome sequencing to 're-discover' the BCR-ABL1 (ref. 10) gene fusion in a chronic myelogenous leukaemia cell line and the TMPRSS2-ERG gene fusion in a prostate cancer cell line and tissues. Additionally, we nominated, and experimentally validated, novel gene fusions resulting in chimaeric transcripts in cancer cell lines and tumours. Taken together, this study establishes a robust pipeline for the discovery of novel gene chimaeras using high-throughput sequencing, opening up an important class of cancer-related mutations for comprehensive characterization.

923 citations


Journal ArticleDOI
TL;DR: In the relatively short time frame since 2005, NGS has fundamentally altered genomics research and allowed investigators to conduct experiments that were previously not technically feasible or affordable, and further improvements in technology robustness and process streamlining will pave the path for translation into clinical diagnostics.
Abstract: Background: For the past 30 years, the Sanger method has been the dominant approach and gold standard for DNA sequencing. The commercial launch of the first massively parallel pyrosequencing platform in 2005 ushered in the new era of high-throughput genomic analysis now referred to as next-generation sequencing (NGS). Content: This review describes fundamental principles of commercially available NGS platforms. Although the platforms differ in their engineering configurations and sequencing chemistries, they share a technical paradigm in that sequencing of spatially separated, clonally amplified DNA templates or single DNA molecules is performed in a flow cell in a massively parallel manner. Through iterative cycles of polymerase-mediated nucleotide extensions or, in one approach, through successive oligonucleotide ligations, sequence outputs in the range of hundreds of megabases to gigabases are now obtained routinely. Highlighted in this review are the impact of NGS on basic research, bioinformatics considerations, and translation of this technology into clinical diagnostics. Also presented is a view into future technologies, including real-time single-molecule DNA sequencing and nanopore-based sequencing. Summary: In the relatively short time frame since 2005, NGS has fundamentally altered genomics research and allowed investigators to conduct experiments that were previously not technically feasible or affordable. The various technologies that constitute this new paradigm continue to evolve, and further improvements in technology robustness and process streamlining will pave the path for translation into clinical diagnostics.

906 citations


Journal ArticleDOI
TL;DR: The Genomic tRNA Database (GtRNAdb), currently including over 74 000 tRNA genes predicted from 740 species, is created, currently including information by isotype and genetic locus, easily downloadable primary sequences, graphical secondary structures and multiple sequence alignments.
Abstract: Transfer RNAs (tRNAs) represent the single largest, best-understood class of non-protein coding RNA genes found in all living organisms. By far, the major source of new tRNAs is computational identification of genes within newly sequenced genomes. To organize the rapidly growing collection and enable systematic analyses, we created the Genomic tRNA Database (GtRNAdb), currently including over 74 000 tRNA genes predicted from 740 species. The web resource provides overview statistics of tRNA genes within each analyzed genome, including information by isotype and genetic locus, easily downloadable primary sequences, graphical secondary structures and multiple sequence alignments. Direct links for each gene to UCSC eukaryotic and microbial genome browsers provide graphical display of tRNA genes in the context of all other local genetic information. The database can be searched by primary sequence similarity, tRNA characteristics or phylogenetic group. The database is publicly available at http://gtrnadb.ucsc.edu.

851 citations


Journal ArticleDOI
TL;DR: This review outlines some important areas such as the large-scale development of molecular markers for linkage mapping, association mapping, wide crosses and alien introgression, epigenetic modifications, transcript profiling, population genetics and de novo genome/organellar genome assembly for which these technologies are expected to advance crop genetics and breeding, leading to crop improvement.

822 citations


Journal ArticleDOI
TL;DR: A high-throughput genome-based method for genotyping recombinant populations utilizing whole-genome resequencing data generated by the Illumina Genome Analyzer is developed and located a quantitative trait locus of large effect on plant height in a 100-kb region containing the rice "green revolution" gene.
Abstract: The next-generation sequencing technology coupled with the growing number of genome sequences opens the opportunity to redesign genotyping strategies for more effective genetic mapping and genome analysis. We have developed a high-throughput method for genotyping recombinant populations utilizing whole-genome resequencing data generated by the Illumina Genome Analyzer. A sliding window approach is designed to collectively examine genome-wide single nucleotide polymorphisms for genotype calling and recombination breakpoint determination. Using this method, we constructed a genetic map for 150 rice recombinant inbred lines with an expected genotype calling accuracy of 99.94% and a resolution of recombination breakpoints within an average of 40 kb. In comparison to the genetic map constructed with 287 PCR-based markers for the rice population, the sequencing-based method was approximately 20x faster in data collection and 35x more precise in recombination breakpoint determination. Using the sequencing-based genetic map, we located a quantitative trait locus of large effect on plant height in a 100-kb region containing the rice "green revolution" gene. Through computer simulation, we demonstrate that the method is robust for different types of mapping populations derived from organisms with variable quality of genome sequences and is feasible for organisms with large genome sizes and low polymorphisms. With continuous advances in sequencing technologies, this genome-based method may replace the conventional marker-based genotyping approach to provide a powerful tool for large-scale gene discovery and for addressing a wide range of biological questions.

773 citations


Journal ArticleDOI
18 Jun 2009-Nature
TL;DR: In this paper, a look at the crucial functional elements of fly and worm genomes could change the way genetic information produces complex organisms, and the results showed that the functional elements were crucial for the evolution of complex organisms.
Abstract: Despite the successes of genomics, little is known about how genetic information produces complex organisms. A look at the crucial functional elements of fly and worm genomes could change that.

Journal ArticleDOI
09 Sep 2009-Nature
TL;DR: Equipped with the tools emerging from the genomics revolution, scientists are now in a position to link molecular states to physiological ones through the reverse engineering of molecular networks that sense DNA and environmental perturbations and, as a result, drive variations in physiological states associated with disease.
Abstract: The molecular biology revolution led to an intense focus on the study of interactions between DNA, RNA and protein biosynthesis in order to develop a more comprehensive understanding of the cell. One consequence of this focus was a reduced attention to whole-system physiology, making it difficult to link molecular biology to clinical medicine. Equipped with the tools emerging from the genomics revolution, we are now in a position to link molecular states to physiological ones through the reverse engineering of molecular networks that sense DNA and environmental perturbations and, as a result, drive variations in physiological states associated with disease.

Journal Article
TL;DR: This issue, modENCODE team members outline their plan of campaign and data from the project are to be made available on http://www.modencode.org and elsewhere as the work progresses.
Abstract: Despite the successes of genomics, little is known about how genetic information produces complex organisms. A look at the crucial functional elements of fly and worm genomes could change that.

Journal ArticleDOI
TL;DR: Genomic islands play a crucial role in the evolution of a broad spectrum of bacteria as they are involved in the dissemination of variable genes, including antibiotic resistance and virulence genes leading to generation of hospital ‘superbugs’, as well as catabolic genes lead to formation of new metabolic pathways.
Abstract: Bacterial genomes evolve through mutations, rearrangements or horizontal gene transfer. Besides the core genes encoding essential metabolic functions, bacterial genomes also harbour a number of accessory genes acquired by horizontal gene transfer that might be beneficial under certain environmental conditions. The horizontal gene transfer contributes to the diversification and adaptation of microorganisms, thus having an impact on the genome plasticity. A significant part of the horizontal gene transfer is or has been facilitated by genomic islands (GEIs). GEIs are discrete DNA segments, some of which are mobile and others which are not, or are no longer mobile, which differ among closely related strains. A number of GEIs are capable of integration into the chromosome of the host, excision, and transfer to a new host by transformation, conjugation or transduction. GEIs play a crucial role in the evolution of a broad spectrum of bacteria as they are involved in the dissemination of variable genes, including antibiotic resistance and virulence genes leading to generation of hospital 'superbugs', as well as catabolic genes leading to formation of new metabolic pathways. Depending on the composition of gene modules, the same type of GEIs can promote survival of pathogenic as well as environmental bacteria.

Journal ArticleDOI
TL;DR: Dozens of mutations previously described in OMIM and hundreds of nonsynonymous single-nucleotide and structural variants in genes previously implicated in disease are identified in this individual.
Abstract: We describe the genome sequencing of an anonymous individual of African origin using a novel ligation-based sequencing assay that enables a unique form of error correction that improves the raw accuracy of the aligned reads to >99.9%, allowing us to accurately call SNPs with as few as two reads per allele. We collected several billion mate-paired reads yielding approximately 18x haploid coverage of aligned sequence and close to 300x clone coverage. Over 98% of the reference genome is covered with at least one uniquely placed read, and 99.65% is spanned by at least one uniquely placed mate-paired clone. We identify over 3.8 million SNPs, 19% of which are novel. Mate-paired data are used to physically resolve haplotype phases of nearly two-thirds of the genotypes obtained and produce phased segments of up to 215 kb. We detect 226,529 intra-read indels, 5590 indels between mate-paired reads, 91 inversions, and four gene fusions. We use a novel approach for detecting indels between mate-paired reads that are smaller than the standard deviation of the insert size of the library and discover deletions in common with those detected with our intra-read approach. Dozens of mutations previously described in OMIM and hundreds of nonsynonymous single-nucleotide and structural variants in genes previously implicated in disease are identified in this individual. There is more genetic variation in the human genome still to be uncovered, and we provide guidance for future surveys in populations and cancer biopsies.

Journal ArticleDOI
TL;DR: In this paper, the authors harnessed the power of next-generation sequencing to successfully identify additional genes that will be described in this report, and found common genetic predisposing factors for PAH can be identified by genome-wide association studies.

Journal ArticleDOI
TL;DR: Both the approach of 16S rDNA amplicon sequencing and the whole-genome sequencing approach may be useful for human metagenomics, and numerous bioinformatics tools are being deployed to tackle such vast amounts of microbiological sequence diversity.
Abstract: Background: The Human Microbiome Project has ushered in a new era for human metagenomics and high-throughput next-generation sequencing strategies. Content: This review describes evolving strategies in metagenomics, with a special emphasis on the core technology of DNA pyrosequencing. The challenges of microbial identification in the context of microbial populations are discussed. The development of next-generation pyrosequencing strategies and the technical hurdles confronting these methodologies are addressed. Bioinformatics-related topics include taxonomic systems, sequence databases, sequence-alignment tools, and classifiers. DNA sequencing based on 16S rRNA genes or entire genomes is summarized with respect to potential pyrosequencing applications. Summary: Both the approach of 16S rDNA amplicon sequencing and the whole-genome sequencing approach may be useful for human metagenomics, and numerous bioinformatics tools are being deployed to tackle such vast amounts of microbiological sequence diversity. Metagenomics, or genetic studies of microbial communities, may ultimately contribute to a more comprehensive understanding of human health, disease susceptibilities, and the pathophysiology of infectious and immune-mediated diseases.

Journal ArticleDOI
TL;DR: The UCSC Genome Browser Database (GBD) is a publicly available collection of genome assembly sequence data and integrated annotations for a large number of organisms, including extensive comparative-genomic resources.
Abstract: The UCSC Genome Browser Database (GBD, http://genome.ucsc.edu) is a publicly available collection of genome assembly sequence data and integrated annotations for a large number of organisms, including extensive comparative-genomic resources. In the past year, 13 new genome assemblies have been added, including two important primate species, orangutan and marmoset, bringing the total to 46 assemblies for 24 different vertebrates and 39 assemblies for 22 different invertebrate animals. The GBD datasets may be viewed graphically with the UCSC Genome Browser, which uses a coordinate-based display system allowing users to juxtapose a wide variety of data. These data include all mRNAs from GenBank mapped to all organisms, RefSeq alignments, gene predictions, regulatory elements, gene expression data, repeats, SNPs and other variation data, as well as pairwise and multiple-genome alignments. A variety of other bioinformatics tools are also provided, including BLAT, the Table Browser, the Gene Sorter, the Proteome Browser, VisiGene and Genome Graphs.

Journal ArticleDOI
TL;DR: It is found that shared environmental pressures and interactions among coevolving organisms do not obscure genome signatures in acid mine drainage communities and genome signatures can be used to assign sequence fragments to populations, an essential prerequisite if metagenomics is to provide ecological and biochemical insights into the functioning of microbial communities.
Abstract: Background: Analyses of DNA sequences from cultivated microorganisms have revealed genome-wide, taxa-specific nucleotide compositional characteristics, referred to as genome signatures. These signatures have far-reaching implications for understanding genome evolution and potential application in classification of metagenomic sequence fragments. However, little is known regarding the distribution of genome signatures in natural microbial communities or the extent to which environmental factors shape them. Results: We analyzed metagenomic sequence data from two acidophilic biofilm communities, including composite genomes reconstructed for nine archaea, three bacteria, and numerous associated viruses, as well as thousands of unassigned fragments from strain variants and lowabundance organisms. Genome signatures, in the form of tetranucleotide frequencies analyzed by emergent self-organizing maps, segregated sequences from all known populations sharing < 50 to 60% average amino acid identity and revealed previously unknown genomic clusters corresponding to low-abundance organisms and a putative plasmid. Signatures were pervasive genome-wide. Clusters were resolved because intra-genome differences resulting from translational selection or protein adaptation to the intracellular (pH ~5) versus extracellular (pH ~1) environment were small relative to inter-genome differences. We found that these genome signatures stem from multiple influences but are primarily manifested through codon composition, which we propose is the result of genome-specific mutational biases. Conclusions: An important conclusion is that shared environmental pressures and interactions among coevolving organisms do not obscure genome signatures in acid mine drainage communities. Thus, genome signatures can be used to assign sequence fragments to populations, an essential prerequisite if metagenomics is to provide ecological and biochemical insights into the functioning of microbial communities.

Journal ArticleDOI
TL;DR: A genome-wide siRNA screen was employed to identify additional genes involved in genome stabilization by monitoring phosphorylation of the histone variant H2AX, and data indicate that preservation of genome stability is mediated by a larger network of biological processes than previously appreciated.

Journal ArticleDOI
TL;DR: The methods described here for deep sequencing of the transcriptome should be widely applicable to generate catalogs of genes and genetic markers in emerging model organisms to facilitate genomics studies in corals and other non-model systems.
Abstract: New methods are needed for genomic-scale analysis of emerging model organisms that exemplify important biological questions but lack fully sequenced genomes. For example, there is an urgent need to understand the potential for corals to adapt to climate change, but few molecular resources are available for studying these processes in reef-building corals. To facilitate genomics studies in corals and other non-model systems, we describe methods for transcriptome sequencing using 454, as well as strategies for assembling a useful catalog of genes from the output. We have applied these methods to sequence the transcriptome of planulae larvae from the coral Acropora millepora. More than 600,000 reads produced in a single 454 sequencing run were assembled into ~40,000 contigs with five-fold average sequencing coverage. Based on sequence similarity with known proteins, these analyses identified ~11,000 different genes expressed in a range of conditions including thermal stress and settlement induction. Assembled sequences were annotated with gene names, conserved domains, and Gene Ontology terms. Targeted searches using these annotations identified the majority of genes associated with essential metabolic pathways and conserved signaling pathways, as well as novel candidate genes for stress-related processes. Comparisons with the genome of the anemone Nematostella vectensis revealed ~8,500 pairs of orthologs and ~100 candidate coral-specific genes. More than 30,000 SNPs were detected in the coral sequences, and a subset of these validated by re-sequencing. The methods described here for deep sequencing of the transcriptome should be widely applicable to generate catalogs of genes and genetic markers in emerging model organisms. Our data provide the most comprehensive sequence resource currently available for reef-building corals, and include an extensive collection of potential genetic markers for association and population connectivity studies. The characterization of the larval transcriptome for this widely-studied coral will enable research into the biological processes underlying stress responses in corals and evolutionary adaptation to global climate change.

Journal ArticleDOI
TL;DR: The core of T 6SS is composed of 13 proteins, conserved in both pathogenic and non-pathogenic bacteria, suggesting that T6SS has evolved to adapt to various microenvironments and specialized functions.
Abstract: The availability of hundreds of bacterial genomes allowed a comparative genomic study of the Type VI Secretion System (T6SS), recently discovered as being involved in pathogenesis By combining comparative and phylogenetic approaches using more than 500 prokaryotic genomes, we characterized the global T6SS genetic structure in terms of conservation, evolution and genomic organization This genome wide analysis allowed the identification of a set of 13 proteins constituting the T6SS protein core and a set of conserved accessory proteins 176 T6SS loci (encompassing 92 different bacteria) were identified and their comparison revealed that T6SS-encoded genes have a specific conserved genetic organization Phylogenetic reconstruction based on the core genes showed that lateral transfer of the T6SS is probably its major way of dissemination among pathogenic and non-pathogenic bacteria Furthermore, the sequence analysis of the VgrG proteins, proposed to be exported in a T6SS-dependent way, confirmed that some C-terminal regions possess domains showing similarities with adhesins or proteins with enzymatic functions The core of T6SS is composed of 13 proteins, conserved in both pathogenic and non-pathogenic bacteria Subclasses of T6SS differ in regulatory and accessory protein content suggesting that T6SS has evolved to adapt to various microenvironments and specialized functions Based on these results, new functional hypotheses concerning the assembly and function of T6SS proteins are proposed

Journal ArticleDOI
08 Oct 2009-Nature
TL;DR: This study provides a path to high-throughput and low-cost direct RNA sequencing and achieving the ultimate goal of a comprehensive and bias-free understanding of transcriptomes.
Abstract: Our understanding of human biology and disease is ultimately dependent on a complete understanding of the genome and its functions. The recent application of microarray and sequencing technologies to transcriptomics has changed the simplistic view of transcriptomes to a more complicated view of genome-wide transcription where a large fraction of transcripts emanates from unannotated parts of genomes, and underlined our limited knowledge of the dynamic state of transcription. Most of this broad body of knowledge was obtained indirectly because current transcriptome analysis methods typically require RNA to be converted to complementary DNA (cDNA) before measurements, even though the cDNA synthesis step introduces multiple biases and artefacts that interfere with both the proper characterization and quantification of transcripts. Furthermore, cDNA synthesis is not particularly suitable for the analysis of short, degraded and/or small quantity RNA samples. Here we report direct single molecule RNA sequencing without prior conversion of RNA to cDNA. We applied this technology to sequence femtomole quantities of poly(A)(+) Saccharomyces cerevisiae RNA using a surface coated with poly(dT) oligonucleotides to capture the RNAs at their natural poly(A) tails and initiate sequencing by synthesis. We observed transcript 3' end heterogeneity and polyadenylated small nucleolar RNAs. This study provides a path to high-throughput and low-cost direct RNA sequencing and achieving the ultimate goal of a comprehensive and bias-free understanding of transcriptomes.

Journal ArticleDOI
TL;DR: From an analysis of a phylogenetically diverse set of eukaryotic genome assemblies, it is found that the proportion of CEGs mapped in draft genomes provides a useful metric for describing the gene space, and complements the commonly used N50 length and x-fold coverage values.
Abstract: Genome sequencing projects have been initiated for a wide range of eukaryotes. A few projects have reached completion, but most exist as draft assemblies. As one of the main reasons to sequence a genome is to obtain its catalog of genes, an important question is how complete or completable the catalog is in unfinished genomes. To answer this question, we have identified a set of core eukaryotic genes (CEGs), that are extremely highly conserved and which we believe are present in low copy numbers in higher eukaryotes. From an analysis of a phylogenetically diverse set of eukaryotic genome assemblies, we found that the proportion of CEGs mapped in draft genomes provides a useful metric for describing the gene space, and complements the commonly used N50 length and x-fold coverage values.

Journal ArticleDOI
TL;DR: SUPERFAMILY provides structural, functional and evolutionary information for proteins from all completely sequenced genomes, and large sequence collections such as UniProt and recent extensions to the database include InterPro abstracts and Gene Ontology terms for superfamiles.
Abstract: SUPERFAMILY provides structural, functional and evolutionary information for proteins from all completely sequenced genomes, and large sequence collections such as UniProt. Protein domain assignments for over 900 genomes are included in the database, which can be accessed at http://supfam.org/. Hidden Markov models based on Structural Classification of Proteins (SCOP) domain definitions at the superfamily level are used to provide structural annotation. We recently produced a new model library based on SCOP 1.73. Family level assignments are also available. From the web site users can submit sequences for SCOP domain classification; search for keywords such as superfamilies, families, organism names, models and sequence identifiers; find over- and underrepresented families or superfamilies within a genome relative to other genomes or groups of genomes; compare domain architectures across selections of genomes and finally build multiple sequence alignments between Protein Data Bank (PDB), genomic and custom sequences. Recent extensions to the database include InterPro abstracts and Gene Ontology terms for superfamiles, taxonomic visualization of the distribution of families across the tree of life, searches for functionally similar domain architectures and phylogenetic trees. The database, models and associated scripts are available for download from the ftp site.

Journal ArticleDOI
TL;DR: Paired-end tag (PET) sequencing for various applications, collectively called the PET sequencing strategy, in which short and paired tags are extracted from the ends of long DNA fragments for ultra-high-throughput sequencing, has a bright future ahead.
Abstract: Comprehensive understanding of functional elements in the human genome will require thorough interrogation and comparison of individual human genomes and genomic structures. Such an endeavor will require improvements in the throughputs and costs of DNA sequencing. Next-generation sequencing platforms have impressively low costs and high throughputs but are limited by short read lengths. An immediate and widely recognized solution to this critical limitation is the paired-end tag (PET) sequencing for various applications, collectively called the PET sequencing strategy, in which short and paired tags are extracted from the ends of long DNA fragments for ultra-high-throughput sequencing. The PET sequences can be accurately mapped to the reference genome, thus demarcating the genomic boundaries of PET-represented DNA fragments and revealing the identities of the target DNA elements. PET protocols have been developed for the analyses of transcriptomes, transcription factor binding sites, epigenetic sites such as histone modification sites, and genome structures. The exclusive advantage of the PET technology is its ability to uncover linkages between the two ends of DNA fragments. Using this unique feature, unconventional fusion transcripts, genome structural variations, and even molecular interactions between distant genomic elements can be unraveled by PET analysis. Extensive use of PET data could lead to efficient assembly of individual human genomes, transcriptomes, and interactomes, enabling new biological and clinical insights. With its versatile and powerful nature for DNA analysis, the PET sequencing strategy has a bright future ahead.

Journal ArticleDOI
TL;DR: EDGAR provides novel analysis features and significantly simplifies the comparative analysis of related genomes and supports a quick survey of evolutionary relationships and simplifying the process of obtaining new biological insights into the differential gene content of kindred genomes.
Abstract: The introduction of next generation sequencing approaches has caused a rapid increase in the number of completely sequenced genomes. As one result of this development, it is now feasible to analyze large groups of related genomes in a comparative approach. A main task in comparative genomics is the identification of orthologous genes in different genomes and the classification of genes as core genes or singletons. To support these studies EDGAR – "Efficient Database framework for comparative Genome Analyses using BLAST score Ratios" – was developed. EDGAR is designed to automatically perform genome comparisons in a high throughput approach. Comparative analyses for 582 genomes across 75 genus groups taken from the NCBI genomes database were conducted with the software and the results were integrated into an underlying database. To demonstrate a specific application case, we analyzed ten genomes of the bacterial genus Xanthomonas, for which phylogenetic studies were awkward due to divergent taxonomic systems. The resultant phylogeny EDGAR provided was consistent with outcomes from traditional approaches performed recently and moreover, it was possible to root each strain with unprecedented accuracy. EDGAR provides novel analysis features and significantly simplifies the comparative analysis of related genomes. The software supports a quick survey of evolutionary relationships and simplifies the process of obtaining new biological insights into the differential gene content of kindred genomes. Visualization features, like synteny plots or Venn diagrams, are offered to the scientific community through a web-based and therefore platform independent user interface http://edgar.cebitec.uni-bielefeld.de , where the precomputed data sets can be browsed.

Journal ArticleDOI
09 Oct 2009-Science
TL;DR: In this article, the authors propose a method to distinguish good from poor data sets by navigating through the databases to find the number and type of reads deposited in sequence trace repositories (and not all genomes have this available), or to identify the number of contigs or genome fragments deposited to the database.
Abstract: For over a decade, genome sequences have adhered to only two standards that are relied on for purposes of sequence analysis by interested third parties (1, 2). However, ongoing developments in revolutionary sequencing technologies have resulted in a redefinition of traditional whole-genome sequencing that requires reevaluation of such standards. With commercially available 454 pyrosequencing (followed by Illumina, SOLiD, and now Helicos), there has been an explosion of genomes sequenced under the moniker “draft”; however, these can be very poor quality genomes (due to inherent errors in the sequencing technologies, and the inability of assembly programs to fully address these errors). Further, one can only infer that such draft genomes may be of poor quality by navigating through the databases to find the number and type of reads deposited in sequence trace repositories (and not all genomes have this available), or to identify the number of contigs or genome fragments deposited to the database. The difficulty in assessing the quality of such deposited genomes has created some havoc for genome analysis pipelines and has contributed to many wasted hours. Exponential leaps in raw sequencing capability and greatly reduced prices have further skewed the time- and cost-ratios of draft data generation versus the painstaking process of improving and finishing a genome. The result is an ever-widening gap between drafted and finished genomes that only promises to continue (see the figure, page 236); hence, there is an urgent need to distinguish good from poor data sets.

Journal ArticleDOI
TL;DR: Routine clinical use of massively parallel sequencing will require higher accuracy, better ways to select genomic subsets of interest, and improvements in the functionality, speed, and ease of use of data analysis software, which will increase the responsibility of geneticists to ensure that the information obtained is used in a medically and socially responsible manner.
Abstract: Massively parallel sequencing has reduced the cost and increased the throughput of genomic sequencing by more than three orders of magnitude, and it seems likely that costs will fall and throughput improve even more in the next few years. Clinical use of massively parallel sequencing will provide a way to identify the cause of many diseases of unknown etiology through simultaneous screening of thousands of loci for pathogenic mutations and by sequencing biological specimens for the genomic signatures of novel infectious agents. In addition to providing these entirely new diagnostic capabilities, massively parallel sequencing may also replace arrays and Sanger sequencing in clinical applications where they are currently being used. Routine clinical use of massively parallel sequencing will require higher accuracy, better ways to select genomic subsets of interest, and improvements in the functionality, speed, and ease of use of data analysis software. In addition, substantial enhancements in laboratory computer infrastructure, data storage, and data transfer capacity will be needed to handle the extremely large data sets produced. Clinicians and laboratory personnel will require training to use the sequence data effectively, and appropriate methods will need to be developed to deal with the incidental discovery of pathogenic mutations and variants of uncertain clinical significance. Massively parallel sequencing has the potential to transform the practice of medical genetics and related fields, but the vast amount of personal genomic data produced will increase the responsibility of geneticists to ensure that the information obtained is used in a medically and socially responsible manner.

Journal ArticleDOI
TL;DR: It is recommended that systems medicine should be developed through an international network of systems biology and medicine centers dedicated to inter-disciplinary training and education, to help reduce the gap in healthcare between developed and developing countries.
Abstract: High-throughput technologies for DNA sequencing and for analyses of transcriptomes, proteomes and metabolomes have provided the foundations for deciphering the structure, variation and function of the human genome and relating them to health and disease states. The increased efficiency of DNA sequencing opens up the possibility of analyzing a large number of individual genomes and transcriptomes, and complete reference proteomes and metabolomes are within reach using powerful analytical techniques based on chromatography, mass spectrometry and nuclear magnetic resonance. Computational and mathematical tools have enabled the development of systems approaches for deciphering the functional and regulatory networks underlying the behavior of complex biological systems. Further conceptual and methodological developments of these tools are needed for the integration of various data types across the multiple levels of organization and time frames that are characteristic of human development, physiology and disease. Medical genomics has attempted to overcome the initial limitations of genome-wide association studies and has identified a limited number of susceptibility loci for many complex and common diseases. Iterative systems approaches are starting to provide deeper insights into the mechanisms of human diseases, and to facilitate the development of better diagnostic and prognostic biomarkers for cancer and many other diseases. Systems approaches will transform the way drugs are developed through academy-industry partnerships that will target multiple components of networks and pathways perturbed in diseases. They will enable medicine to become predictive, personalized, preventive and participatory, and, in the process, concepts and methods from Western and oriental cultures can be combined. We recommend that systems medicine should be developed through an international network of systems biology and medicine centers dedicated to inter-disciplinary training and education, to help reduce the gap in healthcare between developed and developing countries.