scispace - formally typeset
Search or ask a question

Showing papers on "Genomics published in 2003"


Journal ArticleDOI
John W. Belmont1, Paul Hardenbol, Thomas D. Willis, Fuli Yu1, Huanming Yang2, Lan Yang Ch'Ang, Wei Huang3, Bin Liu2, Yan Shen3, Paul K.H. Tam4, Lap-Chee Tsui4, Mary M.Y. Waye5, Jeffrey Tze Fei Wong6, Changqing Zeng2, Qingrun Zhang2, Mark S. Chee7, Luana Galver7, Semyon Kruglyak7, Sarah S. Murray7, Arnold Oliphant7, Alexandre Montpetit8, Fanny Chagnon8, Vincent Ferretti8, Martin Leboeuf8, Michael S. Phillips8, Andrei Verner8, Shenghui Duan9, Denise L. Lind10, Raymond D. Miller9, John P. Rice9, Nancy L. Saccone9, Patricia Taillon-Miller9, Ming Xiao10, Akihiro Sekine, Koki Sorimachi, Yoichi Tanaka, Tatsuhiko Tsunoda, Eiji Yoshino, David R. Bentley11, Sarah E. Hunt11, Don Powell11, Houcan Zhang12, Ichiro Matsuda13, Yoshimitsu Fukushima14, Darryl Macer15, Eiko Suda15, Charles N. Rotimi16, Clement Adebamowo17, Toyin Aniagwu17, Patricia A. Marshall18, Olayemi Matthew17, Chibuzor Nkwodimmah17, Charmaine D.M. Royal16, Mark Leppert19, Missy Dixon19, Fiona Cunningham20, Ardavan Kanani20, Gudmundur A. Thorisson20, Peter E. Chen21, David J. Cutler21, Carl S. Kashuk21, Peter Donnelly22, Jonathan Marchini22, Gilean McVean22, Simon Myers22, Lon R. Cardon22, Andrew P. Morris22, Bruce S. Weir23, James C. Mullikin24, Michael Feolo24, Mark J. Daly25, Renzong Qiu26, Alastair Kent, Georgia M. Dunston16, Kazuto Kato27, Norio Niikawa28, Jessica Watkin29, Richard A. Gibbs1, Erica Sodergren1, George M. Weinstock1, Richard K. Wilson9, Lucinda Fulton9, Jane Rogers11, Bruce W. Birren25, Hua Han2, Hongguang Wang, Martin Godbout30, John C. Wallenburg8, Paul L'Archevêque, Guy Bellemare, Kazuo Todani, Takashi Fujita, Satoshi Tanaka, Arthur L. Holden, Francis S. Collins24, Lisa D. Brooks24, Jean E. McEwen24, Mark S. Guyer24, Elke Jordan31, Jane Peterson24, Jack Spiegel24, Lawrence M. Sung32, Lynn F. Zacharia24, Karen Kennedy29, Michael Dunn29, Richard Seabrook29, Mark Shillito, Barbara Skene29, John Stewart29, David Valle21, Ellen Wright Clayton33, Lynn B. Jorde19, Aravinda Chakravarti21, Mildred K. Cho34, Troy Duster35, Troy Duster36, Morris W. Foster37, Maria Jasperse38, Bartha Maria Knoppers39, Pui-Yan Kwok10, Julio Licinio40, Jeffrey C. Long41, Pilar N. Ossorio42, Vivian Ota Wang33, Charles N. Rotimi16, Patricia Spallone29, Patricia Spallone43, Sharon F. Terry44, Eric S. Lander25, Eric H. Lai45, Deborah A. Nickerson46, Gonçalo R. Abecasis41, David Altshuler47, Michael Boehnke41, Panos Deloukas11, Julie A. Douglas41, Stacey Gabriel25, Richard R. Hudson48, Thomas J. Hudson8, Leonid Kruglyak49, Yusuke Nakamura50, Robert L. Nussbaum24, Stephen F. Schaffner25, Stephen T. Sherry24, Lincoln Stein20, Toshihiro Tanaka 
18 Dec 2003-Nature
TL;DR: The HapMap will allow the discovery of sequence variants that affect common disease, will facilitate development of diagnostic tools, and will enhance the ability to choose targets for therapeutic intervention.
Abstract: The goal of the International HapMap Project is to determine the common patterns of DNA sequence variation in the human genome and to make this information freely available in the public domain. An international consortium is developing a map of these patterns across the genome by determining the genotypes of one million or more sequence variants, their frequencies and the degree of association between them, in DNA samples from populations with ancestry from parts of Africa, Asia and Europe. The HapMap will allow the discovery of sequence variants that affect common disease, will facilitate development of diagnostic tools, and will enhance our ability to choose targets for therapeutic intervention.

5,926 citations


Journal ArticleDOI
01 Aug 2003-Science
TL;DR: Genome-wide analysis of the distribution of integration events revealed the existence of a large integration site bias at both the chromosome and gene levels, and insertion mutations were identified in genes that are regulated in response to the plant hormone ethylene.
Abstract: Over 225,000 independent Agrobacterium transferred DNA (T-DNA) insertion events in the genome of the reference plant Arabidopsis thaliana have been created that represent near saturation of the gene space. The precise locations were determined for more than 88,000 T-DNA insertions, which resulted in the identification of mutations in more than 21,700 of the approximately 29,454 predicted Arabidopsis genes. Genome-wide analysis of the distribution of integration events revealed the existence of a large integration site bias at both the chromosome and gene levels. Insertion mutations were identified in genes that are regulated in response to the plant hormone ethylene.

5,227 citations


Journal ArticleDOI
16 Jan 2003-Nature
TL;DR: It is found that genes of similar functions are clustered in distinct, multi-megabase regions of individual chromosomes; genes in these regions tend to share transcriptional profiles.
Abstract: A principal challenge currently facing biologists is how to connect the complete DNA sequence of an organism to its development and behaviour. Large-scale targeted-deletions have been successful in defining gene functions in the single-celled yeast Saccharomyces cerevisiae, but comparable analyses have yet to be performed in an animal. Here we describe the use of RNA interference to inhibit the function of ∼86% of the 19,427 predicted genes of C. elegans. We identified mutant phenotypes for 1,722 genes, about two-thirds of which were not previously associated with a phenotype. We find that genes of similar functions are clustered in distinct, multi-megabase regions of individual chromosomes; genes in these regions tend to share transcriptional profiles. Our resulting data set and reusable RNAi library of 16,757 bacterial clones will facilitate systematic analyses of the connections among gene sequence, chromosomal location and gene function in C. elegans.

3,529 citations


Journal ArticleDOI
15 May 2003-Nature
TL;DR: A comparative analysis of the yeast Saccharomyces cerevisiae based on high-quality draft sequences of three related species, which inferred a putative function for most of these motifs, and provided insights into their combinatorial interactions.
Abstract: Identifying the functional elements encoded in a genome is one of the principal challenges in modern biology. Comparative genomics should offer a powerful, general approach. Here, we present a comparative analysis of the yeast Saccharomyces cerevisiae based on high-quality draft sequences of three related species (S. paradoxus, S. mikatae and S. bayanus). We first aligned the genomes and characterized their evolution, defining the regions and mechanisms of change. We then developed methods for direct identification of genes and regulatory motifs. The gene analysis yielded a major revision to the yeast gene catalogue, affecting approximately 15% of all genes and reducing the total count by about 500 genes. The motif analysis automatically identified 72 genome-wide elements, including most known regulatory motifs and numerous new motifs. We inferred a putative function for most of these motifs, and provided insights into their combinatorial interactions. The results have implications for genome analysis of diverse organisms, including the human.

1,837 citations


Journal ArticleDOI
24 Apr 2003-Nature
TL;DR: The Human Genome Project (HGP) as mentioned in this paper was the first attempt to obtain a high-quality, comprehensive sequence of the human genome, in this fiftieth anniversary year of the discovery of the double-helical structure of DNA.
Abstract: The completion of a high-quality, comprehensive sequence of the human genome, in this fiftieth anniversary year of the discovery of the double-helical structure of DNA, is a landmark event. The genomic era is now a reality. In contemplating a vision for the future of genomics research,it is appropriate to consider the remarkable path that has brought us here. The rollfold (Figure 1) shows a timeline of landmark accomplishments in genetics and genomics, beginning with Gregor Mendel’s discovery of the laws of heredity and their rediscovery in the early days of the twentieth century.Recognition of DNA as the hereditary material, determination of its structure, elucidation of the genetic code, development of recombinant DNA technologies, and establishment of increasingly automatable methods for DNA sequencing set the stage for the Human Genome Project (HGP) to begin in 1990 (see also www.nature.com/nature/DNA50). Thanks to the vision of the original planners, and the creativity and determination of a legion of talented scientists who decided to make this project their overarching focus, all of the initial objectives of the HGP have now been achieved at least two years ahead of expectation, and a revolution in biological research has begun. The project’s new research strategies and experimental technologies have generated a steady stream of ever-larger and more complex genomic data sets that have poured into public databases and have transformed the study of virtually all life processes. The genomic approach of technology development and large-scale generation of community resource data sets has introduced an important new dimension into biological and biomedical research. Interwoven advances in genetics, comparative genomics, highthroughput biochemistry and bioinformatics

1,704 citations


Journal ArticleDOI
TL;DR: Analysis of the complete asexual intraerythrocytic developmental cycle (IDC) transcriptome of the HB3 strain of P. falciparum demonstrates that this parasite has evolved an extremely specialized mode of transcriptional regulation that produces a continuous cascade of gene expression, beginning with genes corresponding to general cellular processes, such as protein synthesis, and ending with Plasmodium-specific functionalities.
Abstract: Plasmodium falciparum is the causative agent of the most burdensome form of human malaria, affecting 200–300 million individuals per year worldwide. The recently sequenced genome of P. falciparum revealed over 5,400 genes, of which 60% encode proteins of unknown function. Insights into the biochemical function and regulation of these genes will provide the foundation for future drug and vaccine development efforts toward eradication of this disease. By analyzing the complete asexual intraerythrocytic developmental cycle (IDC) transcriptome of the HB3 strain of P. falciparum, we demonstrate that at least 60% of the genome is transcriptionally active during this stage. Our data demonstrate that this parasite has evolved an extremely specialized mode of transcriptional regulation that produces a continuous cascade of gene expression, beginning with genes corresponding to general cellular processes, such as protein synthesis, and ending with Plasmodium-specific functionalities, such as genes involved in erythrocyte invasion. The data reveal that genes contiguous along the chromosomes are rarely coregulated, while transcription from the plastid genome is highly coregulated and likely polycistronic. Comparative genomic hybridization between HB3 and the reference genome strain (3D7) was used to distinguish between genes not expressed during the IDC and genes not detected because of possible sequence variations. Genomic differences between these strains were found almost exclusively in the highly antigenic subtelomeric regions of chromosomes. The simple cascade of gene regulation that directs the asexual development of P. falciparum is unprecedented in eukaryotic biology. The transcriptome of the IDC resembles a “just-in-time” manufacturing process whereby induction of any given gene occurs once per cycle and only at a time when it is required. These data provide to our knowledge the first comprehensive view of the timing of transcription throughout the intraerythrocytic development of P. falciparum and provide a resource for the identification of new chemotherapeutic and vaccine candidates.

1,598 citations


Journal ArticleDOI
20 Mar 2003-Nature
TL;DR: In this paper, the authors describe comprehensive genetic screens of mouse, plant and human transcriptomes by considering gene expression values as quantitative traits and identify a gene expression pattern strongly associated with obesity in a murine cross and observe two distinct obesity subtypes.
Abstract: Treating messenger RNA transcript abundances as quantitative traits and mapping gene expression quantitative trait loci for these traits has been pursued in gene-specific ways. Transcript abundances often serve as a surrogate for classical quantitative traits in that the levels of expression are significantly correlated with the classical traits across members of a segregating population. The correlation structure between transcript abundances and classical traits has been used to identify susceptibility loci for complex diseases such as diabetes and allergic asthma. One study recently completed the first comprehensive dissection of transcriptional regulation in budding yeast, giving a detailed glimpse of a genome-wide survey of the genetics of gene expression. Unlike classical quantitative traits, which often represent gross clinical measurements that may be far removed from the biological processes giving rise to them, the genetic linkages associated with transcript abundance affords a closer look at cellular biochemical processes. Here we describe comprehensive genetic screens of mouse, plant and human transcriptomes by considering gene expression values as quantitative traits. We identify a gene expression pattern strongly associated with obesity in a murine cross, and observe two distinct obesity subtypes. Furthermore, we find that these obesity subtypes are under the control of different loci.

1,539 citations


Journal ArticleDOI
23 Oct 2003-Nature
TL;DR: The results suggest that data sets consisting of single or a small number of concatenated genes have a significant probability of supporting conflicting topologies, and have important implications for resolving branches of the tree of life.
Abstract: One of the most pervasive challenges in molecular phylogenetics is the incongruence between phylogenies obtained using different data sets, such as individual genes. To systematically investigate the degree of incongruence, and potential methods for resolving it, we screened the genome sequences of eight yeast species and selected 106 widely distributed orthologous genes for phylogenetic analyses, singly and by concatenation. Our results suggest that data sets consisting of single or a small number of concatenated genes have a significant probability of supporting conflicting topologies. By contrast, analyses of the entire data set of concatenated genes yielded a single, fully resolved species tree with maximum support. Comparable results were obtained with a concatenation of a minimum of 20 genes; substantially more genes than commonly used but a small fraction of any genome. These results have important implications for resolving branches of the tree of life.

1,490 citations


Journal ArticleDOI
TL;DR: The most useful contribution of the genomics model to population genetics will be improving inferences about population demography and evolutionary history.
Abstract: Population genomics has the potential to improve studies of evolutionary genetics, molecular ecology and conservation biology, by facilitating the identification of adaptive molecular variation and by improving the estimation of important parameters such as population size, migration rates and phylogenetic relationships. There has been much excitement in the recent literature about the identification of adaptive molecular variation using the population-genomic approach. However, the most useful contribution of the genomics model to population genetics will be improving inferences about population demography and evolutionary history.

1,276 citations


Journal ArticleDOI
TL;DR: Comparisons of the two genomes exhibit extensive colinearity, and the rate of divergence appears to be higher in the chromosomal arms than in the centers, which will help to understand the evolutionary forces that mold nematode genomes.
Abstract: The soil nematodes Caenorhabditis briggsae and Caenorhabditis elegans diverged from a common ancestor roughly 100 million years ago and yet are almost indistinguishable by eye. They have the same chromosome number and genome sizes, and they occupy the same ecological niche. To explore the basis for this striking conservation of structure and function, we have sequenced the C. briggsae genome to a high-quality draft stage and compared it to the finished C. elegans sequence. We predict approximately 19,500 protein-coding genes in the C. briggsae genome, roughly the same as in C. elegans. Of these, 12,200 have clear C. elegans orthologs, a further 6,500 have one or more clearly detectable C. elegans homologs, and approximately 800 C. briggsae genes have no detectable matches in C. elegans. Almost all of the noncoding RNAs (ncRNAs) known are shared between the two species. The two genomes exhibit extensive colinearity, and the rate of divergence appears to be higher in the chromosomal arms than in the centers. Operons, a distinctive feature of C. elegans, are highly conserved in C. briggsae, with the arrangement of genes being preserved in 96% of cases. The difference in size between the C. briggsae (estimated at approximately 104 Mbp) and C. elegans (100.3 Mbp) genomes is almost entirely due to repetitive sequence, which accounts for 22.4% of the C. briggsae genome in contrast to 16.5% of the C. elegans genome. Few, if any, repeat families are shared, suggesting that most were acquired after the two species diverged or are undergoing rapid evolution. Coclustering the C. elegans and C. briggsae proteins reveals 2,169 protein families of two or more members. Most of these are shared between the two species, but some appear to be expanding or contracting, and there seem to be as many as several hundred novel C. briggsae gene families. The C. briggsae draft sequence will greatly improve the annotation of the C. elegans genome. Based on similarity to C. briggsae, we found strong evidence for 1,300 new C. elegans genes. In addition, comparisons of the two genomes will help to understand the evolutionary forces that mold nematode genomes.

954 citations


Journal ArticleDOI
31 Oct 2003-Science
TL;DR: In this paper, a dual experimental strategy was used to verify and correct the initial genome sequence annotation of the reference plant Arabidopsis and identified 5817 novel transcription units including a substantial amount of antisense gene transcription, and 40 genes within the genetically defined centromeres.
Abstract: Functional analysis of a genome requires accurate gene structure information and a complete gene inventory. A dual experimental strategy was used to verify and correct the initial genome sequence annotation of the reference plant Arabidopsis. Sequencing full-length cDNAs and hybridizations using RNA populations from various tissues to a set of high-density oligonucleotide arrays spanning the entire genome allowed the accurate annotation of thousands of gene structures. We identified 5817 novel transcription units, including a substantial amount of antisense gene transcription, and 40 genes within the genetically defined centromeres. This approach resulted in completion of approximately 30% of the Arabidopsis ORFeome as a resource for global functional experimentation of the plant proteome.

Journal ArticleDOI
TL;DR: T4 functional genomics will aid in the interpretation of these newly sequenced T4-related genomes and in broadening the understanding of the complex evolution and ecology of phages—the most abundant and among the most ancient biological entities on Earth.
Abstract: Phage T4 has provided countless contributions to the paradigms of genetics and biochemistry. Its complete genome sequence of 168,903 bp encodes about 300 gene products. T4 biology and its genomic sequence provide the best-understood model for modern functional genomics and proteomics. Variations on gene expression, including overlapping genes, internal translation initiation, spliced genes, translational bypassing, and RNA processing, alert us to the caveats of purely computational methods. The T4 transcriptional pattern reflects its dependence on the host RNA polymerase and the use of phage-encoded proteins that sequentially modify RNA polymerase; transcriptional activator proteins, a phage sigma factor, anti-sigma, and sigma decoy proteins also act to specify early, middle, and late promoter recognition. Posttranscriptional controls by T4 provide excellent systems for the study of RNA-dependent processes, particularly at the structural level. The redundancy of DNA replication and recombination systems of T4 reveals how phage and other genomes are stably replicated and repaired in different environments, providing insight into genome evolution and adaptations to new hosts and growth environments. Moreover, genomic sequence analysis has provided new insights into tail fiber variation, lysis, gene duplications, and membrane localization of proteins, while high-resolution structural determination of the "cell-puncturing device," combined with the three-dimensional image reconstruction of the baseplate, has revealed the mechanism of penetration during infection. Despite these advances, nearly 130 potential T4 genes remain uncharacterized. Current phage-sequencing initiatives are now revealing the similarities and differences among members of the T4 family, including those that infect bacteria other than Escherichia coli. T4 functional genomics will aid in the interpretation of these newly sequenced T4-related genomes and in broadening our understanding of the complex evolution and ecology of phages-the most abundant and among the most ancient biological entities on Earth.

Journal ArticleDOI
TL;DR: In this review, retrovirus-mediated strategies used for investigation of gene functions and function-based screening strategies are described.

Journal ArticleDOI
14 Aug 2003-Nature
TL;DR: The generation and analysis of over 12 megabases of sequence from 12 species, all derived from the genomic region orthologous to a segment of about 1.8 Mb on human chromosome 7 containing ten genes, show conservation reflecting both functional constraints and the neutral mutational events that shaped this genomic region.
Abstract: The systematic comparison of genomic sequences from different organisms represents a central focus of contemporary genome analysis. Comparative analyses of vertebrate sequences can identify coding and conserved non-coding regions, including regulatory elements, and provide insight into the forces that have rendered modern-day genomes. As a complement to whole-genome sequencing efforts, we are sequencing and comparing targeted genomic regions in multiple, evolutionarily diverse vertebrates. Here we report the generation and analysis of over 12 megabases (Mb) of sequence from 12 species, all derived from the genomic region orthologous to a segment of about 1.8 Mb on human chromosome 7 containing ten genes, including the gene mutated in cystic fibrosis. These sequences show conservation reflecting both functional constraints and the neutral mutational events that shaped this genomic region. In particular, we identify substantial numbers of conserved non-coding segments beyond those previously identified experimentally, most of which are not detectable by pair-wise sequence comparisons alone. Analysis of transposable element insertions highlights the variation in genome dynamics among these species and confirms the placement of rodents as a sister group to the primates.

Journal ArticleDOI
TL;DR: The antiquity of the pre-vertebrate lineages that ultimately gave rise to the Adenoviridae is illustrated by morphological similarities between adenoviruses and bacteriophages, and by use of a protein-primed DNA replication strategy by adenOViruses, certain bacteria and bacter iophage, and linear plasmids of fungi and plants.
Abstract: This review provides an update of the genetic content, phylogeny and evolution of the family Adenoviridae. An appraisal of the condition of adenovirus genomics highlights the need to ensure that public sequence information is interpreted accurately. To this end, all complete genome sequences available have been reannotated. Adenoviruses fall into four recognized genera, plus possibly a fifth, which have apparently evolved with their vertebrate hosts, but have also engaged in a number of interspecies transmission events. Genes inherited by all modern adenoviruses from their common ancestor are located centrally in the genome and are involved in replication and packaging of viral DNA and formation and structure of the virion. Additional niche-specific genes have accumulated in each lineage, mostly near the genome termini. Capture and duplication of genes in the setting of a ‘leader–exon structure’, which results from widespread use of splicing, appear to have been central to adenovirus evolution. The antiquity of the pre-vertebrate lineages that ultimately gave rise to the Adenoviridae is illustrated by morphological similarities between adenoviruses and bacteriophages, and by use of a protein-primed DNA replication strategy by adenoviruses, certain bacteria and bacteriophages, and linear plasmids of fungi and plants.

Journal ArticleDOI
TL;DR: The present estimate suggests a simple last universal common ancestor with only 500–600 genes, based on the principle of evolutionary parsimony, is suggested.
Abstract: Comparative genomics, using computational and experimental methods, enables the identification of a minimal set of genes that is necessary and sufficient for sustaining a functional cell. For most essential cellular functions, two or more unrelated or distantly related proteins have evolved; only about 60 proteins, primarily those involved in translation, are common to all cellular life. The reconstruction of ancestral life-forms is based on the principle of evolutionary parsimony, but the size and composition of the reconstructed ancestral gene-repertoires depend on relative rates of gene loss and horizontal gene-transfer. The present estimate suggests a simple last universal common ancestor with only 500-600 genes.

Journal ArticleDOI
TL;DR: A deeper understanding of this disease will require new statistical and computational approaches for analysis of the genetic and signaling networks that orchestrate individual cancer susceptibility and tumor behavior.
Abstract: The past decade has seen great strides in our understanding of the genetic basis of human disease. Arguably, the most profound impact has been in the area of cancer genetics, where the explosion of genomic sequence and molecular profiling data has illustrated the complexity of human malignancies. In a tumor cell, dozens of different genes may be aberrant in structure or copy number, and hundreds or thousands of genes may be differentially expressed. A number of familial cancer genes with high-penetrance mutations have been identified, but the contribution of low-penetrance genetic variants or polymorphisms to the risk of sporadic cancer development remains unclear. Studies of the complex somatic genetic events that take place in the emerging cancer cell may aid the search for the more elusive germline variants that confer increased susceptibility. Insights into the molecular pathogenesis of cancer have provided new strategies for treatment, but a deeper understanding of this disease will require new statistical and computational approaches for analysis of the genetic and signaling networks that orchestrate individual cancer susceptibility and tumor behavior.

Journal ArticleDOI
TL;DR: In addition to outlining the extraordinary diversity of mtDNA, this review highlights the divergent trends in mitochondrial genome evolution in the various eukaryotic lineages, and examines the relationship between mitochondrial and nuclear genome Evolution in a given organism.

Journal ArticleDOI
TL;DR: A novel methodology that merges genomics with classical strain improvement has been developed and applied for the reconstruction of classically derived production strains and the path from genomics to biotechnological processes is presented.
Abstract: Corynebacterium glutamicum has played a principal role in the progress of the amino acid fermentation industry. The complete genome sequence of the representative wild-type strain of C. glutamicum, ATCC 13032, has been determined and analyzed to improve our understanding of the molecular biology and physiology of this organism, and to advance the development of more efficient production strains. Genome annotation has helped in elucidation of the gene repertoire defining a desired pathway, which is accelerating pathway engineering. Post genome technologies such as DNA arrays and proteomics are currently undergoing rapid development in C. glutamicum. Such progress has already exposed new regulatory networks and functions that had so far been unidentified in this microbe. The next goal of these studies is to integrate the fruits of genomics into strain development technology. A novel methodology that merges genomics with classical strain improvement has been developed and applied for the reconstruction of classically derived production strains. How can traditional fermentation benefit from the C. glutamicum genomic data? The path from genomics to biotechnological processes is presented.

Journal ArticleDOI
TL;DR: An analysis of 9,434 orthologous genes in human and mouse indicates that alternative splicing is associated with a large increase in frequency of recent exon creation and/or loss.
Abstract: One of the most interesting opportunities in comparative genomics is to compare not only genome sequences but additional phenomena, such as alternative splicing, using orthologous genes in different genomes to find similarities and differences between organisms. Recently, genomics studies have suggested that 40-60% of human genes are alternatively spliced and have catalogued up to 30,000 alternative splice relationships in human genes. Here we report an analysis of 9,434 orthologous genes in human and mouse, which indicates that alternative splicing is associated with a large increase in frequency of recent exon creation and/or loss. Whereas most exons in the mouse and human genomes are strongly conserved in both genomes, exons that are only included in alternative splice forms (as opposed to the constitutive or major transcript form) are mostly not conserved and thus are the product of recent exon creation or loss events. A similar comparison of orthologous exons in rat and human validates this pattern. Although this says nothing about the complex question of adaptive benefit, it does indicate that alternative splicing in these genomes has been associated with increased evolutionary change.

Journal ArticleDOI
TL;DR: Prophages constitute in many bacteria a substantial part of laterally acquired DNA and contribute lysogenic conversion genes that are of selective advantage to the bacterial host.

Journal ArticleDOI
TL;DR: The analysis indicates that single-copy orthologous genes are resistant to horizontal transfer, even in ancient bacterial groups subject to high rates of LGT, thus establishing a foundation for reconstructing the evolutionary transitions that underlie diversity in genome content and organization.
Abstract: The rapid increase in published genomic sequences for bacteria presents the first opportunity to reconstruct evolutionary events on the scale of entire genomes. However, extensive lateral gene transfer (LGT) may thwart this goal by preventing the establishment of organismal relationships based on individual gene phylogenies. The group for which cases of LGT are most frequently documented and for which the greatest density of complete genome sequences is available is the γ-Proteobacteria, an ecologically diverse and ancient group including free-living species as well as pathogens and intracellular symbionts of plants and animals. We propose an approach to multigene phylogeny using complete genomes and apply it to the case of the γ-Proteobacteria. We first applied stringent criteria to identify a set of likely gene orthologs and then tested the compatibilities of the resulting protein alignments with several phylogenetic hypotheses. Our results demonstrate phylogenetic concordance among virtually all (203 of 205) of the selected gene families, with each of the exceptions consistent with a single LGT event. The concatenated sequences of the concordant families yield a fully resolved phylogeny. This topology also received strong support in analyses aimed at excluding effects of heterogeneity in nucleotide base composition across lineages. Our analysis indicates that single-copy orthologous genes are resistant to horizontal transfer, even in ancient bacterial groups subject to high rates of LGT. This gene set can be identified and used to yield robust hypotheses for organismal phylogenies, thus establishing a foundation for reconstructing the evolutionary transitions, such as gene transfer, that underlie diversity in genome content and organization.

Journal ArticleDOI
TL;DR: To increase the reliability of gene function annotation, multiple independent datasets need to be integrated and the recent development of strategies for such integration are reviewed and it is argued that these will be important for a systems approach to modular biology.

Journal ArticleDOI
TL;DR: A rice genome view of homologous wheat genome locations based on comparative sequence analysis revealed numerous chromosomal rearrangements that will significantly complicate the use of rice as a model for cross-species transfer of information in nonconserved regions.
Abstract: The use of DNA sequence-based comparative genomics for evolutionary studies and for transferring information from model species to crop species has revolutionized molecular genetics and crop improvement strategies. This study compared 4485 expressed sequence tags (ESTs) that were physically mapped in wheat chromosome bins, to the public rice genome sequence data from 2251 ordered BAC/PAC clones using BLAST. A rice genome view of homologous wheat genome locations based on comparative sequence analysis revealed numerous chromosomal rearrangements that will significantly complicate the use of rice as a model for cross-species transfer of information in nonconserved regions.

Journal ArticleDOI
Dario Leister1
TL;DR: Recent advances in transcriptomics and proteomics of the chloroplast make this organelle one of the best understood of all plant cell compartments.

Journal ArticleDOI
TL;DR: Mreps as discussed by the authors is a software tool for fast identification of tandemly repeated structures in DNA sequences, which is able to identify all types of repeat structures within a single run on a whole genomic sequence.
Abstract: The presence of repeated sequences is a fundamental feature of genomes. Tandemly repeated DNA appears in both eukaryotic and prokaryotic genomes, it is associated with various regulatory mechanisms and plays an important role in genomic fingerprinting. In this paper, we describe mreps, a powerful software tool for a fast identification of tandemly repeated structures in DNA sequences. mreps is able to identify all types of tandem repeats within a single run on a whole genomic sequence. It has a resolution parameter that allows the program to identify 'fuzzy' repeats. We introduce main algorithmic solutions behind mreps, describe its usage, give some execution time benchmarks and present several case studies to illustrate its capabilities. The mreps web interface is accessible through http://www.loria.fr/mreps/.

Journal ArticleDOI
TL;DR: A comparative study of large datasets of expression profiles from six evolutionarily distant organisms finds that for all organisms the connectivity distribution follows a power-law, highly connected genes tend to be essential and conserved, and the expression program is highly modular.
Abstract: Comparing genomic properties of different organisms is of fundamental importance in the study of biological and evolutionary principles. Although differences among organisms are often attributed to differential gene expression, genome-wide comparative analysis thus far has been based primarily on genomic sequence information. We present a comparative study of large datasets of expression profiles from six evolutionarily distant organisms: S. cerevisiae, C. elegans, E. coli, A. thaliana, D. melanogaster, and H. sapiens. We use genomic sequence information to connect these data and compare global and modular properties of the transcription programs. Linking genes whose expression profiles are similar, we find that for all organisms the connectivity distribution follows a power-law, highly connected genes tend to be essential and conserved, and the expression program is highly modular. We reveal the modular structure by decomposing each set of expression data into coexpressed modules. Functionally related sets of genes are frequently coexpressed in multiple organisms. Yet their relative importance to the transcription program and their regulatory relationships vary among organisms. Our results demonstrate the potential of combining sequence and expression data for improving functional gene annotation and expanding our understanding of how gene expression and diversity evolved.

Journal ArticleDOI
TL;DR: Two strategies for MCS identification are reported, demonstrating their ability to detect virtually all known actively conserved sequences but very little neutrally evolving sequence (specifically, ancestral repeats).
Abstract: A key component of genomics research beyond the Human Genome Project will be the rigorous interpretation of the recently finished human genome sequence (Collins et al. 2003). Central to these efforts will be the identification of all functional elements in the human genome. Recent comparative analyses of the human and mouse genome sequences suggest that ∼5% of the mammalian genome is under active selection and thus likely serves a functional role (International Mouse Genome Sequencing Consortium 2002; Roskin et al. 2003). Within this functional subset is an estimated 1% to 2% of the genome that encodes protein (International Mouse Genome Sequencing Consortium 2002). The prospects for comprehensive identification of these coding sequences are quite good, especially in light of the availability of data sets that are complementary to the genomic sequence (e.g., ESTs [Boguski et al. 1994; also see http://www.ncbi.nlm.nih.gov/dbEST] and full-length cDNA sequences [Strausberg et al. 2002; also see http://mgc.nci.nih.gov]) and ever-improving computational methods for gene prediction (Kulp et al. 1996; Burge and Karlin 1997; Rogic et al. 2001; Solovyev 2001; Flicek et al. 2003). The complete identification and characterization of the remaining 3% to 4% of the mammalian genome that likely corresponds to functional non-coding sequence will be profoundly more challenging, due to the lack of complementary data sets, the absence of robust tools for computational predictions, and the incomplete insight about the nature of such sequence. In short, the generation of a comprehensive “parts list” of functional elements in the human genome remains an immense and important challenge. The comparison of orthologous genomic sequences has emerged as a powerful approach for identifying functional elements in the genome (Dermitzakis et al. 2002; DeSilva et al. 2002). The premise of this approach is that sequences conserved across millions of years of evolution are likely to have a functional role (Pennacchio and Rubin 2001). Comparative sequence analyses have been shown to facilitate the identification of both coding (Batzoglou et al. 2000; Korf et al. 2001; Pennacchio et al. 2001; Alexandersson et al. 2003; Flicek et al. 2003) and functional non-coding (Stojanovic et al. 1999; Dubchak et al. 2000; Gottgens et al. 2000; Loots et al. 2000, 2002; Wasserman et al. 2000; Dehal et al. 2001; Elnitski et al. 2003; Kellis et al. 2003) sequences. Among the latter are elements that regulate the spatial and temporal patterns of gene expression (Hardison 2000). When the generation of alignments between related sequences is not possible, motif-finding techniques have also been used to identify functional sequences, in particular for detecting transcription factor–binding sites (Bailey and Elkan 1995; Roth et al. 1998; Hertz and Stormo 1999; McCue et al. 2001; Blanchette and Tompa 2002). Recent efforts have produced whole-genome sequences for several vertebrates, including human (International Human Genome Sequencing Consortium 2001), mouse (International Mouse Genome Sequencing Consortium 2002), rat (http://genome.ucsc.edu/cgi-bin/hgGateway?org=rat), and pufferfish (Aparicio et al. 2002), with the sequencing of additional vertebrate genomes well underway. Increasingly, methods for visualizing (Kent et al. 2002; Clamp et al. 2003; Karolchik et al. 2003) and comparing (Stojanovic et al. 1999; Mayor et al. 2000; Blanchette and Tompa 2002; Loots et al. 2002; Giardine et al. 2003; Schwartz et al. 2003a) genomic sequences from multiple species are emerging. As a complement to these efforts, we are generating the sequence of targeted genomic regions in multiple, phylogenetically diverse vertebrates (Thomas et al. 2003) and developing computational approaches for identifying the subset of sequences that confers function. In particular, we have focused on developing algorithms for detecting sequences that are highly conserved across multiple species, which we call Multi-species Conserved Sequences (or MCSs); such sequences represent candidates for being functionally important. Here we report the development and testing of methods for MCS detection, including analyses of MCSs identified using a recently generated set of orthologous sequences from 11 non-human vertebrates (Thomas et al. 2003).

Journal ArticleDOI
TL;DR: The strong relationship with alpha-proteobacterial genes observed for some mitochondrial genes, combined with the lack of such a relationship for others, indicates that the modern mitochondrial proteome is the product of both reductive and expansive processes.
Abstract: The availability of complete genome sequence data from both bacteria and eukaryotes provides information about the contribution of bacterial genes to the origin and evolution of mitochondria. Phylogenetic analyses based on genes located in the mitochondrial genome indicate that these genes originated from within the alpha-proteobacteria. A number of ancestral bacterial genes have also been transferred from the mitochondrial to the nuclear genome, as evidenced by the presence of orthologous genes in the mitochondrial genome in some species and in the nuclear genome of other species. However, a multitude of mitochondrial proteins encoded in the nucleus display no homology to bacterial proteins, indicating that these originated within the eukaryotic cell subsequent to the acquisition of the endosymbiont. An analysis of the expression patterns of yeast nuclear genes coding for mitochondrial proteins has shown that genes predicted to be of eukaryotic origin are mainly translated on polysomes that are free in the cytosol whereas those of putative bacterial origin are translated on polysomes attached to the mitochondrion. The strong relationship with alpha-proteobacterial genes observed for some mitochondrial genes, combined with the lack of such a relationship for others, indicates that the modern mitochondrial proteome is the product of both reductive and expansive processes.

Journal ArticleDOI
TL;DR: It is shown that selective growth conditions can induce the expression of gene clusters involved in natural-product biosynthesis, suggesting that the range of enediyne natural products may be much greater than previously thought.
Abstract: Genome analysis of actinomycetes has revealed the presence of numerous cryptic gene clusters encoding putative natural products. These loci remain dormant until appropriate chemical or physical signals induce their expression. Here we demonstrate the use of a high-throughput genome scanning method to detect and analyze gene clusters involved in natural-product biosynthesis. This method was applied to uncover biosynthetic pathways encoding enediyne antitumor antibiotics in a variety of actinomycetes. Comparative analysis of five biosynthetic loci representative of the major structural classes of enediynes reveals the presence of a conserved cassette of five genes that includes a novel family of polyketide synthase (PKS). The enediyne PKS (PKSE) is proposed to be involved in the formation of the highly reactive chromophore ring structure (or "warhead") found in all enediynes. Genome scanning analysis indicates that the enediyne warhead cassette is widely dispersed among actinomycetes. We show that selective growth conditions can induce the expression of these loci, suggesting that the range of enediyne natural products may be much greater than previously thought. This technology can be used to increase the scope and diversity of natural-product discovery.