scispace - formally typeset
Search or ask a question

Showing papers on "Genome published in 2005"


Journal ArticleDOI
TL;DR: A comprehensive search for conserved elements in vertebrate genomes is conducted, using genome-wide multiple alignments of five vertebrate species (human, mouse, rat, chicken, and Fugu rubripes), using a two-state phylogenetic hidden Markov model (phylo-HMM).
Abstract: We have conducted a comprehensive search for conserved elements in vertebrate genomes, using genome-wide multiple alignments of five vertebrate species (human, mouse, rat, chicken, and Fugu rubripes). Parallel searches have been performed with multiple alignments of four insect species (three species of Drosophila and Anopheles gambiae), two species of Caenorhabditis, and seven species of Saccharomyces. Conserved elements were identified with a computer program called phastCons, which is based on a two-state phylogenetic hidden Markov model (phylo-HMM). PhastCons works by fitting a phylo-HMM to the data by maximum likelihood, subject to constraints designed to calibrate the model across species groups, and then predicting conserved elements based on this model. The predicted elements cover roughly 3%-8% of the human genome (depending on the details of the calibration procedure) and substantially higher fractions of the more compact Drosophila melanogaster (37%-53%), Caenorhabditis elegans (18%-37%), and Saccharaomyces cerevisiae (47%-68%) genomes. From yeasts to vertebrates, in order of increasing genome size and general biological complexity, increasing fractions of conserved bases are found to lie outside of the exons of known protein-coding genes. In all groups, the most highly conserved elements (HCEs), by log-odds score, are hundreds or thousands of bases long. These elements share certain properties with ultraconserved elements, but they tend to be longer and less perfectly conserved, and they overlap genes of somewhat different functional categories. In vertebrates, HCEs are associated with the 3' UTRs of regulatory genes, stable gene deserts, and megabase-sized regions rich in moderately conserved noncoding sequences. Noncoding HCEs also show strong statistical evidence of an enrichment for RNA secondary structure.

3,719 citations


Journal ArticleDOI
Takashi Matsumoto1, Jianzhong Wu1, Hiroyuki Kanamori1, Yuichi Katayose1  +262 moreInstitutions (25)
11 Aug 2005-Nature
TL;DR: A map-based, finished quality sequence that covers 95% of the 389 Mb rice genome, including virtually all of the euchromatin and two complete centromeres, and finds evidence for widespread and recurrent gene transfer from the organelles to the nuclear chromosomes.
Abstract: Rice, one of the world's most important food plants, has important syntenic relationships with the other cereal species and is a model plant for the grasses. Here we present a map-based, finished quality sequence that covers 95% of the 389 Mb genome, including virtually all of the euchromatin and two complete centromeres. A total of 37,544 non-transposable-element-related protein-coding genes were identified, of which 71% had a putative homologue in Arabidopsis. In a reciprocal analysis, 90% of the Arabidopsis proteins had a putative homologue in the predicted rice proteome. Twenty-nine per cent of the 37,544 predicted genes appear in clustered gene families. The number and classes of transposable elements found in the rice genome are consistent with the expansion of syntenic regions in the maize and sorghum genomes. We find evidence for widespread and recurrent gene transfer from the organelles to the nuclear chromosomes. The map-based sequence has proven useful for the identification of genes underlying agronomic traits. The additional single-nucleotide polymorphisms and simple sequence repeats identified in our study should accelerate improvements in rice production.

3,423 citations


Journal ArticleDOI
Piero Carninci, Takeya Kasukawa1, Shintaro Katayama, Julian Gough  +194 moreInstitutions (36)
02 Sep 2005-Science
TL;DR: Detailed polling of transcription start and termination sites and analysis of previously unidentified full-length complementary DNAs derived from the mouse genome provide a comprehensive platform for the comparative analysis of mammalian transcriptional regulation in differentiation and development.
Abstract: This study describes comprehensive polling of transcription start and termination sites and analysis of previously unidentified full-length complementary DNAs derived from the mouse genome. We identify the 5' and 3' boundaries of 181,047 transcripts with extensive variation in transcripts arising from alternative promoter usage, splicing, and polyadenylation. There are 16,247 new mouse protein-coding transcripts, including 5154 encoding previously unidentified proteins. Genomic mapping of the transcriptome reveals transcriptional forests, with overlapping transcription on both strands, separated by deserts in which few transcripts are observed. The data provide a comprehensive platform for the comparative analysis of mammalian transcriptional regulation in differentiation and development.

3,412 citations


Journal ArticleDOI
TL;DR: Hundreds of Arabidopsis genes were found that outperform traditional reference genes in terms of expression stability throughout development and under a range of environmental conditions, and the developed PCR primers or hybridization probes for the novel reference genes will enable better normalization and quantification of transcript levels inArabidopsis in the future.
Abstract: Gene transcripts with invariant abundance during development and in the face of environmental stimuli are essential reference points for accurate gene expression analyses, such as RNA gel-blot analysis or quantitative reverse transcription-polymerase chain reaction (PCR). An exceptionally large set of data from Affymetrix ATH1 whole-genome GeneChip studies provided the means to identify a new generation of reference genes with very stable expression levels in the model plant species Arabidopsis (Arabidopsis thaliana). Hundreds of Arabidopsis genes were found that outperform traditional reference genes in terms of expression stability throughout development and under a range of environmental conditions. Most of these were expressed at much lower levels than traditional reference genes, making them very suitable for normalization of gene expression over a wide range of transcript levels. Specific and efficient primers were developed for 22 genes and tested on a diverse set of 20 cDNA samples. Quantitative reverse transcription-PCR confirmed superior expression stability and lower absolute expression levels for many of these genes, including genes encoding a protein phosphatase 2A subunit, a coatomer subunit, and an ubiquitin-conjugating enzyme. The developed PCR primers or hybridization probes for the novel reference genes will enable better normalization and quantification of transcript levels in Arabidopsis in the future.

2,694 citations


Journal ArticleDOI
TL;DR: A KO-Based Annotation System (KOBAS) is developed that can automatically annotate a set of sequences with KO terms and identify both the most frequent and the statistically significantly enriched pathways.
Abstract: Motivation: High-throughput technologies such as DNA sequencing and microarrays have created the need for automated annotation of large sets of genes, including whole genomes, and automated identification of pathways. Ontologies, such as the popular Gene Ontology (GO), provide a common controlled vocabulary for these types of automated analysis. Yet, while GO offers tremendous value, it also has certain limitations such as the lack of direct association with pathways. Results: We demonstrated the use of the KEGG Orthology (KO), part of the KEGG suite of resources, as an alternative controlled vocabulary for automated annotation and pathway identification. We developed a KO-Based Annotation System (KOBAS) that can automatically annotate a set of sequences with KO terms and identify both the most frequent and the statistically significantly enriched pathways. Results from both whole genome and microarray gene cluster annotations with KOBAS are comparable and complementary to known annotations. KOBAS is a freely available standalone Python program that can contribute significantly to genome annotation and microarray analysis. Availability: Supplementary data and the KOBAS system are available at http://genome.cbi.pku.edu.cn/download.html Contact: weilp@mail.cbi.pku.edu.cn

2,595 citations


Journal ArticleDOI
08 Dec 2005-Nature
TL;DR: A high-quality draft genome sequence of the domestic dog is reported, together with a dense map of single nucleotide polymorphisms (SNPs) across breeds, to shed light on the structure and evolution of genomes and genes.
Abstract: Here we report a high-quality draft genome sequence of the domestic dog (Canis familiaris), together with a dense map of single nucleotide polymorphisms (SNPs) across breeds. The dog is of particular interest because it provides important evolutionary information and because existing breeds show great phenotypic diversity for morphological, physiological and behavioural traits. We use sequence comparison with the primate and rodent lineages to shed light on the structure and evolution of genomes and genes. Notably, the majority of the most highly conserved non-coding sequences in mammalian genomes are clustered near a small subset of genes with important roles in development. Analysis of SNPs reveals long-range haplotypes across the entire dog genome, and defines the nature of genetic diversity within and across breeds. The current SNP map now makes it possible for genome-wide association studies to identify genes responsible for diseases and traits, with important consequences for human and companion animal health.

2,431 citations


Journal ArticleDOI
01 Sep 2005-Nature
TL;DR: It is found that the patterns of evolution in human and chimpanzee protein-coding genes are highly correlated and dominated by the fixation of neutral and slightly deleterious alleles.
Abstract: Here we present a draft genome sequence of the common chimpanzee (Pan troglodytes). Through comparison with the human genome, we have generated a largely complete catalogue of the genetic differenc ...

2,267 citations


Journal ArticleDOI
TL;DR: The genomic sequence of six strains representing the five major disease-causing serotypes of Streptococcus agalactiae, the main cause of neonatal infection in humans, was generated and Mathematical extrapolation of the data suggests that the gene reservoir available for inclusion in the S. agalactic pan-genome is vast and that unique genes will continue to be identified even after sequencing hundreds of genomes.
Abstract: The development of efficient and inexpensive genome sequencing methods has revolutionized the study of human bacterial pathogens and improved vaccine design. Unfortunately, the sequence of a single genome does not reflect how genetic variability drives pathogenesis within a bacterial species and also limits genome-wide screens for vaccine candidates or for antimicrobial targets. We have generated the genomic sequence of six strains representing the five major disease-causing serotypes of Streptococcus agalactiae, the main cause of neonatal infection in humans. Analysis of these genomes and those available in databases showed that the S. agalactiae species can be described by a pan-genome consisting of a core genome shared by all isolates, accounting for ≈80% of any single genome, plus a dispensable genome consisting of partially shared and strain-specific genes. Mathematical extrapolation of the data suggests that the gene reservoir available for inclusion in the S. agalactiae pan-genome is vast and that unique genes will continue to be identified even after sequencing hundreds of genomes.

2,092 citations


Journal ArticleDOI
TL;DR: An interactive system, Galaxy, that combines the power of existing genome annotation databases with a simple Web portal to enable users to search remote resources, combine data from independent queries, and visualize the results.
Abstract: Accessing and analyzing the exponentially expanding genomic sequence and functional data pose a challenge for biomedical researchers. Here we describe an interactive system, Galaxy, that combines the power of existing genome annotation databases with a simple Web portal to enable users to search remote resources, combine data from independent queries, and visualize the results. The heart of Galaxy is a flexible history system that stores the queries from each user; performs operations such as intersections, unions, and subtractions; and links to other computational tools. Galaxy can be accessed at http://g2.bx.psu.edu.

2,071 citations


Journal ArticleDOI
17 Mar 2005-Nature
TL;DR: In this article, a comparative analysis of the human, mouse, rat and dog genomes is presented to create a systematic catalogue of common regulatory motifs in promoters and 3' untranslated regions (3' UTRs).
Abstract: Comprehensive identification of all functional elements encoded in the human genome is a fundamental need in biomedical research. Here, we present a comparative analysis of the human, mouse, rat and dog genomes to create a systematic catalogue of common regulatory motifs in promoters and 3' untranslated regions (3' UTRs). The promoter analysis yields 174 candidate motifs, including most previously known transcription-factor binding sites and 105 new motifs. The 3'-UTR analysis yields 106 motifs likely to be involved in post-transcriptional regulation. Nearly one-half are associated with microRNAs (miRNAs), leading to the discovery of many new miRNA genes and their likely target genes. Our results suggest that previous estimates of the number of human miRNA genes were low, and that miRNAs regulate at least 20% of human genes. The overall results provide a systematic view of gene regulation in the human, which will be refined as additional mammalian genomes become available.

1,954 citations


Journal ArticleDOI
TL;DR: The average nucleotide identity of the shared genes between two strains was found to be a robust means to compare genetic relatedness among strains, and that ANI values of approximately 94% corresponded to the traditional 70% DNA-DNA reassociation standard of the current species definition.
Abstract: To help advance the species definition for prokaryotes, we have compared the gene content of 70 closely related and fully sequenced bacterial genomes to identify whether species boundaries exist, and to determine the role of the organism's ecology on its shared gene content. We found the average nucleotide identity (ANI) of the shared genes between two strains to be a robust means to compare genetic relatedness among strains, and that ANI values of ≈94% corresponded to the traditional 70% DNA–DNA reassociation standard of the current species definition. At the 94% ANI cutoff, current species includes only moderately homogeneous strains, e.g., most of the >4-Mb genomes share only 65–90% of their genes, apparently as a result of the strains having evolved in different ecological settings. Furthermore, diagnostic genetic signatures (boundaries) are evident between groups of strains of the same species, and the intergroup genetic similarity can be as high as 98–99% ANI, indicating that justifiable species might be found even among organisms that are nearly identical at the nucleotide level. Notably, a large fraction, e.g., up to 65%, of the differences in gene content within species is associated with bacteriophage and transposase elements, revealing an important role of these elements during bacterial speciation. Our findings are consistent with a definition for species that would include a more homogeneous set of strains than provided by the current definition and one that considers the ecology of the strains in addition to their evolutionary distance.

Journal ArticleDOI
Matthew Berriman1, Elodie Ghedin2, Elodie Ghedin3, Christiane Hertz-Fowler1, Gaëlle Blandin3, Hubert Renauld1, Daniella Castanheira Bartholomeu3, Nicola Lennard1, Elisabet Caler3, N. Hamlin1, Brian J. Haas3, Ulrike Böhme1, Linda Hannick3, Martin Aslett1, Joshua Shallom3, Lucio Marcello4, Lihua Hou3, Bill Wickstead5, U. Cecilia M. Alsmark6, Claire Arrowsmith1, Rebecca Atkin1, Andrew Barron1, Frédéric Bringaud7, Karen Brooks1, Mark Carrington8, Inna Cherevach1, Tracey-Jane Chillingworth1, Carol Churcher1, Louise Clark1, Craig Corton1, Ann Cronin1, Robert L. Davies1, Jonathon Doggett1, Appolinaire Djikeng3, Tamara Feldblyum3, Mark C. Field8, Audrey Fraser1, Ian Goodhead1, Zahra Hance1, David Harper1, Barbara Harris1, Heidi Hauser1, Jessica B. Hostetler3, Al Ivens1, Kay Jagels1, David W. Johnson1, Justin Johnson3, Kristine Jones3, Arnaud Kerhornou1, Hean Koo3, Natasha Larke1, Scott M. Landfear9, Christopher Larkin3, Vanessa Leech8, Alexandra Line1, Angela Lord1, Annette MacLeod4, P. Mooney1, Sharon Moule1, David M. A. Martin10, Gareth W. Morgan11, Karen Mungall1, Halina Norbertczak1, Doug Ormond1, Grace Pai3, Christopher S. Peacock1, Jeremy Peterson3, Michael A. Quail1, Ester Rabbinowitsch1, Marie-Adèle Rajandream1, Chris P Reitter8, Steven L. Salzberg3, Mandy Sanders1, Seth Schobel3, Sarah Sharp1, Mark Simmonds1, Anjana J. Simpson3, Luke J. Tallon3, C. Michael R. Turner4, Andrew Tait4, Adrian Tivey1, Susan Van Aken3, Danielle Walker1, David Wanless3, Shiliang Wang3, Brian White1, Owen White3, Sally Whitehead1, John Woodward1, Jennifer R. Wortman3, Mark Raymond Adams12, T. Martin Embley6, Keith Gull5, Elisabetta Ullu13, J. David Barry4, Alan H. Fairlamb10, Fred R. Opperdoes14, Barclay G. Barrell1, John E. Donelson15, Neil Hall3, Neil Hall16, Claire M. Fraser3, Sara E. Melville8, Najib M. El-Sayed2, Najib M. El-Sayed3 
15 Jul 2005-Science
TL;DR: Comparisons of the cytoskeleton and endocytic trafficking systems of Trypanosoma brucei with those of humans and other eukaryotic organisms reveal major differences.
Abstract: African trypanosomes cause human sleeping sickness and livestock trypanosomiasis in sub-Saharan Africa. We present the sequence and analysis of the 11 megabase-sized chromosomes of Trypanosoma brucei. The 26-megabase genome contains 9068 predicted genes, including ∼900 pseudogenes and ∼1700 T. brucei–specific genes. Large subtelomeric arrays contain an archive of 806 variant surface glycoprotein (VSG) genes used by the parasite to evade the mammalian immune system. Most VSG genes are pseudogenes, which may be used to generate expressed mosaic genes by ectopic recombination. Comparisons of the cytoskeleton and endocytic trafficking systems with those of humans and other eukaryotic organisms reveal major differences. A comparison of metabolic pathways encoded by the genomes of T. brucei, T. cruzi, and Leishmania major reveals the least overall metabolic capability in T. brucei and the greatest in L. major. Horizontal transfer of genes of bacterial origin has contributed to some of the metabolic differences in these parasites, and a number of novel potential drug targets have been identified.

Journal ArticleDOI
TL;DR: A new method for de novo identification of repeat families via extension of consensus seeds is developed, which enables a rigorous definition of repeat boundaries, a key issue in repeat analysis.
Abstract: Every time we compare two species that are closer to each other than either is to humans, we get nearly killed by unmasked repeats. Webb Miller (Personal communication) Motivation:De novo repeat family identification is a challenging algorithmic problem of great practical importance. As the number of genome sequencing projects increases, there is a pressing need to identify the repeat families present in large, newly sequenced genomes. We develop a new method for de novo identification of repeat families via extension of consensus seeds; our method enables a rigorous definition of repeat boundaries, a key issue in repeat analysis. Results: Our RepeatScout algorithm is more sensitive and is orders of magnitude faster than RECON, the dominant tool for de novo repeat family identification in newly sequenced genomes. Using RepeatScout, we estimate that ∼2% of the human genome and 4% of mouse and rat genomes consist of previously unannotated repetitive sequence. Availability: Source code is available for download at http://www-cse.ucsd.edu/groups/bioinformatics/software.html Contact: ppevzner@cs.ucsd.edu

Journal ArticleDOI
21 Apr 2005-Nature
TL;DR: The draft sequence of the M. grisea genome is reported, reflecting the clonal nature of this fungus imposed by widespread rice cultivation and analysis of the gene set provides an insight into the adaptations required by a fungus to cause disease.
Abstract: Magnaporthe grisea is the most destructive pathogen of rice worldwide and the principal model organism for elucidating the molecular basis of fungal disease of plants. Here, we report the draft sequence of the M. grisea genome. Analysis of the gene set provides an insight into the adaptations required by a fungus to cause disease. The genome encodes a large and diverse set of secreted proteins, including those defined by unusual carbohydrate-binding domains. This fungus also possesses an expanded family of G-protein-coupled receptors, several new virulence-associated genes and large suites of enzymes involved in secondary metabolism. Consistent with a role in fungal pathogenesis, the expression of several of these genes is upregulated during the early stages of infection-related development. The M. grisea genome has been subject to invasion and proliferation of active transposable elements, reflecting the clonal nature of this fungus imposed by widespread rice cultivation.

Journal ArticleDOI
TL;DR: The sequencing and mapping of the human genome provides a foundation for the elucidation of gene expression and protein function, and the identification of the biochemical pathways implicated in the natural history of chronic diseases.
Abstract: The sequencing and mapping of the human genome provides a foundation for the elucidation of gene expression and protein function, and the identification of the biochemical pathways implicated in the natural history of chronic diseases, including cancer, diabetes, and vascular and neurodegenerative

Journal ArticleDOI
TL;DR: Questions are addressed, including which evolutionary pressures led to gene clustering, why closely related species produce different profiles of secondary metabolites, and whether fungal genomics will accelerate the discovery of new pharmacologically active natural products.
Abstract: Much of natural product chemistry concerns a group of compounds known as secondary metabolites. These low-molecular-weight metabolites often have potent physiological activities. Digitalis, morphine and quinine are plant secondary metabolites, whereas penicillin, cephalosporin, ergotrate and the statins are equally well known fungal secondary metabolites. Although chemically diverse, all secondary metabolites are produced by a few common biosynthetic pathways, often in conjunction with morphological development. Recent advances in molecular biology, bioinformatics and comparative genomics have revealed that the genes encoding specific fungal secondary metabolites are clustered and often located near telomeres. In this review, we address some important questions, including which evolutionary pressures led to gene clustering, why closely related species produce different profiles of secondary metabolites, and whether fungal genomics will accelerate the discovery of new pharmacologically active natural products.

Journal ArticleDOI
TL;DR: The reconstruction of regulatory networks from expression profiles of human B cells is reported, suggestive of a hierarchical, scale-free network, where a few highly interconnected genes (hubs) account for most of the interactions.
Abstract: Cellular phenotypes are determined by the differential activity of networks linking coregulated genes. Available methods for the reverse engineering of such networks from genome-wide expression profiles have been successful only in the analysis of lower eukaryotes with simple genomes. Using a new method called ARACNe (algorithm for the reconstruction of accurate cellular networks), we report the reconstruction of regulatory networks from expression profiles of human B cells. The results are suggestive a hierarchical, scale-free network, where a few highly interconnected genes (hubs) account for most of the interactions. Validation of the network against available data led to the identification of MYC as a major hub, which controls a network comprising known target genes as well as new ones, which were biochemically validated. The newly identified MYC targets include some major hubs. This approach can be generally useful for the analysis of normal and pathologic networks in mammalian cells.

Journal ArticleDOI
TL;DR: The hypothesis that the relatively large and complex vertebrate genome was created by two ancient, whole genome duplications has been hotly debated, and the potential for these large-scale genomic events to have driven the evolutionary success of the vertebrate lineage is highlighted.
Abstract: The hypothesis that the relatively large and complex vertebrate genome was created by two ancient, whole genome duplications has been hotly debated, but remains unresolved. We reconstructed the evolutionary relationships of all gene families from the complete gene sets of a tunicate, fish, mouse, and human, and then determined when each gene duplicated relative to the evolutionary tree of the organisms. We confirmed the results of earlier studies that there remains little signal of these events in numbers of duplicated genes, gene tree topology, or the number of genes per multigene family. However, when we plotted the genomic map positions of only the subset of paralogous genes that were duplicated prior to the fish–tetrapod split, their global physical organization provides unmistakable evidence of two distinct genome duplication events early in vertebrate evolution indicated by clear patterns of four-way paralogous regions covering a large part of the human genome. Our results highlight the potential for these large-scale genomic events to have driven the evolutionary success of the vertebrate lineage.

Journal ArticleDOI
Alasdair Ivens1, Christopher S. Peacock1, Elizabeth A. Worthey2, Lee Murphy1, Gautam Aggarwal2, Matthew Berriman1, Ellen Sisk2, Marie-Adèle Rajandream1, Ellen Adlem1, Rita Aert3, Atashi Anupama2, Zina Apostolou, Philip Attipoe2, Nathalie Bason1, Christopher Bauser4, Alfred Beck5, Stephen M. Beverley6, Gabriella Bianchettin7, K. Borzym5, G. Bothe4, Carlo V. Bruschi7, Carlo V. Bruschi8, Matt Collins1, Eithon Cadag2, Laura Ciarloni7, Christine Clayton, Richard M.R. Coulson9, Ann Cronin1, Angela K. Cruz10, Robert L. Davies1, Javier G. De Gaudenzi11, Deborah E. Dobson6, Andreas Duesterhoeft, Gholam Fazelina2, Nigel Fosker1, Alberto C.C. Frasch11, Audrey Fraser1, Monika Fuchs, Claudia Gabel, Arlette Goble1, André Goffeau12, David Harris1, Christiane Hertz-Fowler1, Helmut Hilbert, David Horn13, Yiting Huang2, Sven Klages5, Andrew J Knights1, Michael Kube5, Natasha Larke1, Lyudmila Litvin2, Angela Lord1, Tin Louie2, Marco A. Marra, David Masuy12, Keith R. Matthews14, Shulamit Michaeli, Jeremy C. Mottram15, Silke Müller-Auer, Heather Munden2, Siri Nelson2, Halina Norbertczak1, Karen Oliver1, Susan O'Neil1, Martin Pentony2, Thomas M. Pohl4, Claire Price1, Bénédicte Purnelle12, Michael A. Quail1, Ester Rabbinowitsch1, Richard Reinhardt5, Michael A. Rieger, Joel Rinta2, Johan Robben3, Laura Robertson2, Jeronimo C. Ruiz10, Simon Rutter1, David L. Saunders1, Melanie Schäfer, Jacquie Schein, David C. Schwartz16, Kathy Seeger1, Amber Seyler2, Sarah Sharp1, Heesun Shin, Dhileep Sivam2, Rob Squares1, Steve Squares1, Valentina Tosato7, Christy Vogt2, Guido Volckaert3, Rolf Wambutt, T. Warren1, Holger Wedler, John Woodward1, Shiguo Zhou16, Wolfgang Zimmermann, Deborah F. Smith17, Jenefer M. Blackwell18, Kenneth Stuart2, Kenneth Stuart19, Bart Barrell1, Peter J. Myler19, Peter J. Myler2 
15 Jul 2005-Science
TL;DR: The organization of protein-coding genes into long, strand-specific, polycistronic clusters and lack of general transcription factors in the L. major, Trypanosoma brucei, and Tritryp genomes suggest that the mechanisms regulating RNA polymerase II–directed transcription are distinct from those operating in other eukaryotes, although the trypanosomatids appear capable of chromatin remodeling.
Abstract: Leishmania species cause a spectrum of human diseases in tropical and subtropical regions of the world. We have sequenced the 36 chromosomes of the 32.8-megabase haploid genome of Leishmania major (Friedlin strain) and predict 911 RNA genes, 39 pseudogenes, and 8272 protein-coding genes, of which 36% can be ascribed a putative function. These include genes involved in host-pathogen interactions, such as proteolytic enzymes, and extensive machinery for synthesis of complex surface glycoconjugates. The organization of protein-coding genes into long, strand-specific, polycistronic clusters and lack of general transcription factors in the L. major, Trypanosoma brucei, and Trypanosoma cruzi (Tritryp) genomes suggest that the mechanisms regulating RNA polymerase II-directed transcription are distinct from those operating in other eukaryotes, although the trypanosomatids appear capable of chromatin remodeling. Abundant RNA-binding proteins are encoded in the Tritryp genomes, consistent with active posttranscriptional regulation of gene expression.

Journal ArticleDOI
Najib M. El-Sayed1, Peter J. Myler2, Peter J. Myler3, Daniella Castanheira Bartholomeu4, Daniel Nilsson5, Gautam Aggarwal2, Anh-Nhi Tran5, Elodie Ghedin1, Elizabeth A. Worthey2, Arthur L. Delcher, Gaëlle Blandin4, Scott J. Westenberger6, Elisabet Caler4, Gustavo C. Cerqueira7, Carole Branche5, Brian J. Haas4, Atashi Anupama2, Erik Arner5, Lena Åslund8, Philip Attipoe2, Esteban J. Bontempi5, Frédéric Bringaud9, Peter Burton10, Eithon Cadag2, David A. Campbell6, Mark Carrington11, Jonathan Crabtree4, Hamid Darban5, José Franco da Silveira12, Pieter J. de Jong13, Kimberly Edwards5, Paul T. Englund14, Gholam Fazelina2, Tamara Feldblyum4, Marcela Ferella5, Alberto C.C. Frasch15, Keith Gull16, David Horn17, Lihua Hou4, Yiting Huang2, Ellen Kindlund5, Michele M. Klingbeil18, Sindy Kluge5, Hean Koo4, Daniela R. Lacerda19, Mariano J. Levin20, Hernan Lorenzi20, Tin Louie2, Carlos Renato Machado7, Richard McCulloch10, Alan McKenna5, Yumi Mizuno5, Jeremy C. Mottram10, Siri Nelson2, Stephen Ochaya5, Kazutoyo Osoegawa13, Grace Pai4, Marilyn Parsons2, Marilyn Parsons3, Martin Pentony2, Ulf Pettersson8, Mihai Pop4, José Luis Ramírez21, Joel Rinta2, Laura Robertson2, Steven L. Salzberg, Daniel O. Sánchez15, Amber Seyler2, Reuben Sunil Kumar Sharma11, Jyoti Shetty4, Anjana J. Simpson4, Ellen Sisk2, Martti T. Tammi5, Martti T. Tammi22, Rick L. Tarleton23, Santuza M. R. Teixeira7, Susan Van Aken4, Christy Vogt2, Pauline N. Ward10, Bill Wickstead16, Jennifer R. Wortman4, Owen White4, Claire M. Fraser4, Kenneth Stuart2, Kenneth Stuart3, Björn Andersson5 
15 Jul 2005-Science
TL;DR: Although the Tritryp lack several classes of signaling molecules, their kinomes contain a large and diverse set of protein kinases and phosphatases; their size and diversity imply previously unknown interactions and regulatory processes, which may be targets for intervention.
Abstract: Whole-genome sequencing of the protozoan pathogen Trypanosoma cruzi revealed that the diploid genome contains a predicted 22,570 proteins encoded by genes, of which 12,570 represent allelic pairs. Over 50% of the genome consists of repeated sequences, such as retrotransposons and genes for large families of surface molecules, which include trans-sialidases, mucins, gp63s, and a large novel family (>1300 copies) of mucin-associated surface protein (MASP) genes. Analyses of the T. cruzi, T. brucei, and Leishmania major (Tritryp) genomes imply differences from other eukaryotes in DNA repair and initiation of replication and reflect their unusual mitochondrial DNA. Although the Tritryp lack several classes of signaling molecules, their kinomes contain a large and diverse set of protein kinases and phosphatases; their size and diversity imply previously unknown interactions and regulatory processes, which may be targets for intervention.

Journal ArticleDOI
TL;DR: A number of elements in this region that have undergone intense purifying selection throughout mammalian evolution are described, and it is shown that these important elements are more numerous than previously thought.
Abstract: Comparisons of orthologous genomic DNA sequences can be used to characterize regions that have been subject to purifying selection and are enriched for functional elements. We here present the results of such an analysis on an alignment of sequences from 29 mammalian species. The alignment captures ∼3.9 neutral substitutions per site and spans ∼1.9 Mbp of the human genome. We identify constrained elements from 3 bp to over 1 kbp in length, covering ∼5.5% of the human locus. Our estimate for the total amount of nonexonic constraint experienced by this locus is roughly twice that for exonic constraint. Constrained elements tend to cluster, and we identify large constrained regions that correspond well with known functional elements. While constraint density inversely correlates with mobile element density, we also show the presence of unambiguously constrained elements overlapping mammalian ancestral repeats. In addition, we describe a number of elements in this region that have undergone intense purifying selection throughout mammalian evolution, and we show that these important elements are more numerous than previously thought. These results were obtained with Genomic Evolutionary Rate Profiling (GERP), a statistically rigorous and biologically transparent framework for constrained element identification. GERP identifies regions at high resolution that exhibit nucleotide substitution deficits, and measures these deficits as “rejected substitutions.” Rejected substitutions reflect the intensity of past purifying selection and are used to rank and characterize constrained elements. We anticipate that GERP and the types of analyses it facilitates will provide further insights and improved annotation for the human genome as mammalian genome sequence data become richer.

Journal ArticleDOI
22 Dec 2005-Nature
TL;DR: The aspergilli comprise a diverse group of filamentous fungi spanning over 200 million years of evolution, and a comparative study with Aspergillus fumigatus and As pergillus oryzae, used in the production of sake, miso and soy sauce, provides new insight into eukaryotic genome evolution and gene regulation.
Abstract: The aspergilli comprise a diverse group of filamentous fungi spanning over 200 million years of evolution. Here we report the genome sequence of the model organism Aspergillus nidulans, and a comparative study with Aspergillus fumigatus, a serious human pathogen, and Aspergillus oryzae, used in the production of sake, miso and soy sauce. Our analysis of genome structure provided a quantitative evaluation of forces driving long-term eukaryotic genome evolution. It also led to an experimentally validated model of mating-type locus evolution, suggesting the potential for sexual reproduction in A. fumigatus and A. oryzae. Our analysis of sequence conservation revealed over 5,000 non-coding regions actively conserved across all three species. Within these regions, we identified potential functional elements including a previously uncharacterized TPP riboswitch and motifs suggesting regulation in filamentous fungi by Puf family genes. We further obtained comparative and experimental evidence indicating widespread translational regulation by upstream open reading frames. These results enhance our understanding of these widely studied fungi as well as provide new insight into eukaryotic genome evolution and gene regulation.

Journal ArticleDOI
Ludwig Eichinger1, Justin A. Pachebat2, Justin A. Pachebat1, Gernot Glöckner, Marie-Adèle Rajandream3, Richard Sucgang4, Matthew Berriman3, J. Song4, Rolf Olsen5, Karol Szafranski, Qikai Xu4, Budi Tunggal1, Sarah K. Kummerfeld2, Martin Madera2, Bernard Anri Konfortov2, Francisco Rivero1, Alan T. Bankier2, Rüdiger Lehmann, N. Hamlin3, Robert L. Davies3, Pascale Gaudet6, Petra Fey6, Karen E Pilcher6, Guokai Chen4, David L. Saunders3, Erica Sodergren4, P. Davis3, Arnaud Kerhornou3, X. Nie4, Neil Hall3, Christophe Anjard5, Lisa Hemphill4, Nathalie Bason3, Patrick Farbrother1, Brian A. Desany4, Eric M. Just6, Takahiro Morio7, René Rost8, Carol Churcher3, J. Cooper3, Stephen F. Haydock9, N. van Driessche4, Ann Cronin3, Ian Goodhead3, Donna M. Muzny4, T. Mourier3, Arnab Pain3, Mingyang Lu4, D. Harper3, R. Lindsay4, Heidi Hauser3, Kylie R. James3, M. Quiles4, M. Madan Babu2, Tsuneyuki Saito10, Carmen Buchrieser11, A. Wardroper12, A. Wardroper2, Marius Felder, M. Thangavelu, D. Johnson3, Andrew J Knights3, H. Loulseged4, Karen Mungall3, Karen Oliver3, Claire Price3, Michael A. Quail3, Hideko Urushihara7, Judith Hernandez4, Ester Rabbinowitsch3, David Steffen4, Mandy Sanders3, Jun Ma4, Yuji Kohara13, Sarah Sharp3, Mark Simmonds3, S. Spiegler3, Adrian Tivey3, Sumio Sugano14, Brian White3, Danielle Walker3, John Woodward3, Thomas Winckler, Yoshiaki Tanaka7, Gad Shaulsky4, Michael Schleicher8, George M. Weinstock4, André Rosenthal, Edward C. Cox15, Rex L. Chisholm6, Richard A. Gibbs4, William F. Loomis5, Matthias Platzer, Robert R. Kay2, Jeffrey G. Williams16, Paul H. Dear2, Angelika A. Noegel1, Bart Barrell3, Adam Kuspa4 
05 May 2005-Nature
TL;DR: A proteome-based phylogeny shows that the amoebozoa diverged from the animal–fungal lineage after the plant–animal split, but Dictyostelium seems to have retained more of the diversity of the ancestral genome than have plants, animals or fungi.
Abstract: The social amoebae are exceptional in their ability to alternate between unicellular and multicellular forms. Here we describe the genome of the best-studied member of this group, Dictyostelium discoideum. The gene-dense chromosomes of this organism encode approximately 12,500 predicted proteins, a high proportion of which have long, repetitive amino acid tracts. There are many genes for polyketide synthases and ABC transporters, suggesting an extensive secondary metabolism for producing and exporting small molecules. The genome is rich in complex repeats, one class of which is clustered and may serve as centromeres. Partial copies of the extrachromosomal ribosomal DNA (rDNA) element are found at the ends of each chromosome, suggesting a novel telomere structure and the use of a common mechanism to maintain both the rDNA and chromosomal termini. A proteome-based phylogeny shows that the amoebozoa diverged from the animal-fungal lineage after the plant-animal split, but Dictyostelium seems to have retained more of the diversity of the ancestral genome than have plants, animals or fungi.

Journal ArticleDOI
TL;DR: Evidence is now supported by evidence showing that genes that are retained in duplicate typically diversify in function or undergo subfunctionalization, with some duplicate genes more prone to retention than others.

Journal ArticleDOI
TL;DR: To identify other miRNA genes in pathogenic viruses, a new miRNA gene prediction method with small-RNA cloning from several virus-infected cell types was combined and predicted miRNAs in several large DNA viruses.
Abstract: Epstein-Barr virus (EBV or HHV4), a member of the human herpesvirus (HHV) family, has recently been shown to encode microRNAs (miRNAs). In contrast to most eukaryotic miRNAs, these viral miRNAs do not have close homologs in other viral genomes or in the genome of the human host. To identify other miRNA genes in pathogenic viruses, we combined a new miRNA gene prediction method with small-RNA cloning from several virus-infected cell types. We cloned ten miRNAs in the Kaposi sarcoma-associated virus (KSHV or HHV8), nine miRNAs in the mouse gammaherpesvirus 68 (MHV68) and nine miRNAs in the human cytomegalovirus (HCMV or HHV5). These miRNA genes are expressed individually or in clusters from either polymerase (pol) II or pol III promoters, and share no substantial sequence homology with one another or with the known human miRNAs. Generally, we predicted miRNAs in several large DNA viruses, and we could neither predict nor experimentally identify miRNAs in the genomes of small RNA viruses or retroviruses.

Journal ArticleDOI
TL;DR: The main factors — including models of the allelic architecture of common diseases, sample size, map density and sample-collection biases — that need to be taken into account in order to optimize the cost efficiency of identifying genuine disease-susceptibility loci are outlined.
Abstract: To fully understand the allelic variation that underlies common diseases, complete genome sequencing for many individuals with and without disease is required. This is still not technically feasible. However, recently it has become possible to carry out partial surveys of the genome by genotyping large numbers of common SNPs in genome-wide association studies. Here, we outline the main factors - including models of the allelic architecture of common diseases, sample size, map density and sample-collection biases - that need to be taken into account in order to optimize the cost efficiency of identifying genuine disease-susceptibility loci.

Journal ArticleDOI
20 May 2005-Science
TL;DR: The transcribed portions of the human genome are predominantly composed of interlaced networks of both poly A+ and poly A– annotated transcripts and unannotated transcripts of unknown function, which has important implications for interpreting genotype-phenotype associations, regulation of gene expression, and the definition of a gene.
Abstract: Sites of transcription of polyadenylated and nonpolyadenylated RNAs for 10 human chromosomes were mapped at 5-base pair resolution in eight cell lines. Unannotated, nonpolyadenylated transcripts comprise the major proportion of the transcriptional output of the human genome. Of all transcribed sequences, 19.4, 43.7, and 36.9% were observed to be polyadenylated, nonpolyadenylated, and bimorphic, respectively. Half of all transcribed sequences are found only in the nucleus and for the most part are unannotated. Overall, the transcribed portions of the human genome are predominantly composed of interlaced networks of both poly A+ and poly A- annotated transcripts and unannotated transcripts of unknown function. This organization has important implications for interpreting genotype-phenotype associations, regulation of gene expression, and the definition of a gene.

Journal ArticleDOI
TL;DR: Until around 1990, most multigene families were thought to be subject to concerted evolution, in which all member genes of a family evolve as a unit in concert, but phylogenetic analysis of MHC and other immune system genes showed a quite different evolutionary pattern, and a new model called birth-and-death evolution was proposed.
Abstract: Until around 1990, most multigene families were thought to be subject to concerted evolution, in which all member genes of a family evolve as a unit in concert. However, phylogenetic analysis of MHC and other immune system genes showed a quite different evolutionary pattern, and a new model called birth-and-death evolution was proposed. In this model, new genes are created by gene duplication and some duplicate genes stay in the genome for a long time, whereas others are inactivated or deleted from the genome. Later investigations have shown that most non-rRNA genes including highly conserved histone or ubiquitin genes are subject to this type of evolution. However, the controversy over the two models is still continuing because the distinction between the two models becomes difficult when sequence differences are small. Unlike concerted evolution, the model of birth-and-death evolution can give some insights into the origins of new genetic systems or new phenotypic characters.

Journal ArticleDOI
22 Dec 2005-Nature
TL;DR: Specific expansion of genes for secretory hydrolytic enzymes, amino acid metabolism and amino acid/sugar uptake transporters supports the idea that A. oryzae is an ideal microorganism for fermentation.
Abstract: The genome of Aspergillus oryzae, a fungus important for the production of traditional fermented foods and beverages in Japan, has been sequenced. The ability to secrete large amounts of proteins and the development of a transformation system have facilitated the use of A. oryzae in modern biotechnology. Although both A. oryzae and Aspergillus flavus belong to the section Flavi of the subgenus Circumdati of Aspergillus, A. oryzae, unlike A. flavus, does not produce aflatoxin, and its long history of use in the food industry has proved its safety. Here we show that the 37-megabase (Mb) genome of A. oryzae contains 12,074 genes and is expanded by 7-9 Mb in comparison with the genomes of Aspergillus nidulans and Aspergillus fumigatus. Comparison of the three aspergilli species revealed the presence of syntenic blocks and A. oryzae-specific blocks (lacking synteny with A. nidulans and A. fumigatus) in a mosaic manner throughout the genome of A. oryzae. The blocks of A. oryzae-specific sequence are enriched for genes involved in metabolism, particularly those for the synthesis of secondary metabolites. Specific expansion of genes for secretory hydrolytic enzymes, amino acid metabolism and amino acid/sugar uptake transporters supports the idea that A. oryzae is an ideal microorganism for fermentation.

Journal ArticleDOI
14 Oct 2005-Science
TL;DR: A high-resolution genetic map of the human genome is presented, based on statistical analyses of genetic variation data, and more than 25,000 recombination hotspots are identified, together with motifs and sequence contexts that play a role in hotspot activity.
Abstract: Genetic maps, which document the way in which recombination rates vary over a genome, are an essential tool for many genetic analyses. We present a high-resolution genetic map of the human genome, based on statistical analyses of genetic variation data, and identify more than 25,000 recombination hotspots, together with motifs and sequence contexts that play a role in hotspot activity. Differences between the behavior of recombination rates over large (megabase) and small (kilobase) scales lead us to suggest a two-stage model for recombination in which hotspots are stochastic features, within a framework in which large-scale rates are constrained.