scispace - formally typeset
Search or ask a question

Showing papers by "Roderic Guigó published in 2001"


Journal ArticleDOI
J. Craig Venter1, Mark Raymond Adams1, Eugene W. Myers1, Peter W. Li1  +269 moreInstitutions (12)
16 Feb 2001-Science
TL;DR: Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems are indicated.
Abstract: A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.

12,098 citations


Journal ArticleDOI
TL;DR: A method to predict selenoproteins in genomic sequences, which relies on the prediction of SECIS elements in coordination with the Prediction of genes in which the strong codon bias characteristic of protein coding regions extends beyond a TGA codon interrupting the open reading frame is described.
Abstract: In selenoproteins, incorporation of the amino acid selenocysteine is specified by the UGA codon, usually a stop signal. The alternative decoding of UGA is conferred by an mRNA structure, the SECIS element, located in the 3′-untranslated region of the selenoprotein mRNA. Because of the non-standard use of the UGA codon, current computational gene prediction methods are unable to identify selenoproteins in the sequence of the eukaryotic genomes. Here we describe a method to predict selenoproteins in genomic sequences, which relies on the prediction of SECIS elements in coordination with the prediction of genes in which the strong codon bias characteristic of protein coding regions extends beyond a TGA codon interrupting the open reading frame. We applied the method to the Drosophila melanogaster genome, and predicted four potential selenoprotein genes. One of them belongs to a known family of selenoproteins, and we have tested experimentally two other predictions with positive results. Finally, we have characterized the expression pattern of these two novel selenoprotein genes.

119 citations


Journal ArticleDOI
TL;DR: A program for the prediction of protein-coding genes, termed SGP-1 (Syntenic Gene Prediction), which is based on the similarity of homologous genomic sequences, which may be applied to nonstandard model organisms in vertebrates as well as in plants, without the need for extensive parameter training.
Abstract: Conventional methods of gene prediction rely on the recognition of DNA-sequence signals, the coding potential or the comparison of a genomic sequence with a cDNA, EST, or protein database. Reasons for limited accuracy in many circumstances are species-specific training and the incompleteness of reference databases. Lately, comparative genome analysis has attracted increasing attention. Several analysis tools that are based on human/mouse comparisons are already available. Here, we present a program for the prediction of protein-coding genes, termed SGP-1 (Syntenic Gene Prediction), which is based on the similarity of homologous genomic sequences. In contrast to most existing tools, the accuracy of SGP-1 depends little on species-specific properties such as codon usage or the nucleotide distribution. SGP-1 may therefore be applied to nonstandard model organisms in vertebrates as well as in plants, without the need for extensive parameter training. In addition to predicting genes in large-scale genomic sequences, the program may be useful to validate gene structure annotations from databases. To this end, SGP-1 output also contains comparisons between predicted and annotated gene structures in HTML format. The program can be accessed via a Web server at http://soft.ice.mpg.de/sgp-1. The source code, written in ANSI C, is available on request from the authors.

103 citations


Journal ArticleDOI
TL;DR: Investigating the degree to which gene structure (intron position, phase and length) is conserved between homologous protein domains finds that for certain extracellular‐signalling and nuclear domains, gene structures are similar even when pro tein sequence similarity is low or not significant and sequences can only be aligned with a knowledge of protein tertiary structure.
Abstract: The evolutionary significance of introns remains a mystery. The current availability of several complete eukaryotic genomes permits new studies to probe the possible function of these peculiar genomic features. Here we investigate the degree to which gene structure (intron position, phase and length) is conserved between homologous protein domains. We find that for certain extracellular-signalling and nuclear domains, gene structures are similar even when pro tein sequence similarity is low or not significant and sequences can only be aligned with a knowledge of protein tertiary structure. In contrast, other domains, including most intracellular signalling modules, show little gene structure conservation. Intriguingly, many domains with conserved gene structures, such as cytokines, are involved in similar biological processes, such as the immune response. This suggests that gene structure conservation may be a record of key events in evolution, such as the origin of the vertebrate immune system or the duplication of nuclear receptors in nematodes. The results suggest ways to detect new and potentially very remote homologues, and to construct phylogenies for proteins with limited sequence similarity.

101 citations