scispace - formally typeset
Search or ask a question

Showing papers in "Genome Research in 2000"


Journal ArticleDOI
TL;DR: The results suggest that strand-slippage theories alone are insufficient to explain microsatellite distribution in the genome as a whole and that taxon-specific variation could also be detected in the frequency distributions of simple sequence motifs.
Abstract: We examined the abundance of microsatellites with repeated unit lengths of 1-6 base pairs in several eukaryotic taxonomic groups: primates, rodents, other mammals, nonmammalian vertebrates, arthropods, Caenorhabditis elegans, plants, yeast, and other fungi. Distribution of simple sequence repeats was compared between exons, introns, and intergenic regions. Tri- and hexanucleotide repeats prevail in protein-coding exons of all taxa, whereas the dependence of repeat abundance on the length of the repeated unit shows a very different pattern as well as taxon-specific variation in intergenic regions and introns. Although it is known that coding and noncoding regions differ significantly in their microsatellite distribution, in addition we could demonstrate characteristic differences between intergenic regions and introns. We observed striking relative abundance of (CCG)(n)*(CGG)(n) trinucleotide repeats in intergenic regions of all vertebrates, in contrast to the almost complete lack of this motif from introns. Taxon-specific variation could also be detected in the frequency distributions of simple sequence motifs. Our results suggest that strand-slippage theories alone are insufficient to explain microsatellite distribution in the genome as a whole. Other possible factors contributing to the observed divergence are discussed.

1,391 citations


Journal ArticleDOI
TL;DR: Results showed that computational gene prediction can be a reliable tool for annotating new genomic sequences, giving accurate information on 90% of coding sequences with 14% false positives, and exact gene prediction needs additional improvement using gene prediction algorithms.
Abstract: Ab initio gene identification in the genomic sequence of Drosophila melanogaster was obtained using (human gene predictor) and Fgenesh programs that have organism-specific parameters for human, Drosophila, plants, yeast, and nematode. We did not use information about cDNA/EST in most predictions to model a real situation for finding new genes because information about complete cDNA is often absent or based on very small partial fragments. We investigated the accuracy of gene prediction on different levels and designed several schemes to predict an unambiguous set of genes (annotation CGG1), a set of reliable exons (annotation CGG2), and the most complete set of exons (annotation CGG3). For 49 genes, protein products of which have clear homologs in protein databases, predictions were recomputed by Fgenesh+ program. The first annotation serves as the optimal computational description of new sequence to be presented in a database. Reliable exons from the second annotation serve as good candidates for selecting the PCR primers for experimental work for gene structure verification. Our results shows that we can identify approximately 90% of coding nucleotides with 20% false positives. At the exon level we accurately predicted 65% of exons and 89% including overlapping exons with 49% false positives. Optimizing accuracy of prediction, we designed a gene identification scheme using Fgenesh, which provided sensitivity (Sn) = 98% and specificity (Sp) = 86% at the base level, Sn = 81% (97% including overlapping exons) and Sp = 58% at the exon level and Sn = 72% and Sp = 39% at the gene level (estimating sensitivity on std1 set and specificity on std3 set). In general, these results showed that computational gene prediction can be a reliable tool for annotating new genomic sequences, giving accurate information on 90% of coding sequences with 14% false positives. However, exact gene prediction (especially at the gene level) needs additional improvement using gene prediction algorithms. The program was also tested for predicting genes of human Chromosome 22 (the last variant of Fgenesh can analyze the whole chromosome sequence). This analysis has demonstrated that the 88% of manually annotated exons in Chromosome 22 were among the ab initio predicted exons. The suite of gene identification programs is available through the WWW server of Computational Genomics Group at http://genomic.sanger.ac.uk/gf. html.

1,163 citations


Journal ArticleDOI
TL;DR: PipMaker is appropriate for comparing genomic sequences from any two related species, although the types of information that can be inferred depend on the level of conservation and the time and divergence rate since the separation of the species.
Abstract: PipMaker (http://bio.cse.psu.edu) is a World-Wide Web site for comparing two long DNA sequences to identify conserved segments and for producing informative, high-resolution displays of the resulting alignments. One display is a percent identity plot (pip), which shows both the position in one sequence and the degree of similarity for each aligning segment between the two sequences in a compact and easily understandable form. Positions along the horizontal axis can be labeled with features such as exons of genes and repetitive elements, and colors can be used to clarify and enhance the display. The web site also provides a plot of the locations of those segments in both species (similar to a dot plot). PipMaker is appropriate for comparing genomic sequences from any two related species, although the types of information that can be inferred (e.g., protein-coding regions and cis-regulatory elements) depend on the level of conservation and the time and divergence rate since the separation of the species. Gene regulatory elements are often detectable as similar, noncoding sequences in species that diverged as much as 100-300 million years ago, such as humans and mice, Caenorhabditis elegans and C. briggsae, or Escherichia coli and Salmonella spp. PipMaker supports analysis of unfinished or "working draft" sequences by permitting one of the two sequences to be in unoriented and unordered contigs.

1,127 citations


Journal ArticleDOI
TL;DR: A method called recombinational cloning is described that uses in vitro site-specific recombination to accomplish the directional cloning of PCR products and the subsequent automatic subcloning of the DNA segment into new vector backbones at high efficiency.
Abstract: As a result of numerous genome sequencing projects, large numbers of candidate open reading frames are being identified, many of which have no known function. Analysis of these genes typically involves the transfer of DNA segments into a variety of vector backgrounds for protein expression and functional analysis. We describe a method called recombinational cloning that uses in vitro site-specific recombination to accomplish the directional cloning of PCR products and the subsequent automatic subcloning of the DNA segment into new vector backbones at high efficiency. Numerous DNA segments can be transferred in parallel into many different vector backgrounds, providing an approach to high-throughput, in-depth functional analysis of genes and rapid optimization of protein expression. The resulting subclones maintain orientation and reading frame register, allowing amino- and carboxy-terminal translation fusions to be generated. In this paper, we outline the concepts of this approach and provide several examples that highlight some of its potential.

1,085 citations



Journal ArticleDOI
TL;DR: Comparative analysis suggests that an excess of chromosome fissions in the tetrapod lineage may account for chromosome numbers and provides histories for several human chromosomes.
Abstract: To help understand mechanisms of vertebrate genome evolution, we have compared zebrafish and tetrapod gene maps. It has been suggested that translocations are fixed more frequently than inversions in mammals. Gene maps showed that blocks of conserved syntenies between zebrafish and humans were large, but gene orders were frequently inverted and transposed. This shows that intrachromosomal rearrangements have been fixed more frequently than translocations. Duplicated chromosome segments suggest that a genome duplication occurred in ray-fin phylogeny, and comparative studies suggest that this event happened deep in the ancestry of teleost fish. Consideration of duplicate chromosome segments shows that at least 20% of duplicated gene pairs may be retained from this event. Despite genome duplication, zebrafish and humans have about the same number of chromosomes, and zebrafish chromosomes are mosaically orthologous to several human chromosomes. Is this because of an excess of chromosome fissions in the human lineage or an excess of chromosome fusions in the zebrafish lineage? Comparative analysis suggests that an excess of chromosome fissions in the tetrapod lineage may account for chromosome numbers and provides histories for several human chromosomes.

673 citations


Journal ArticleDOI
TL;DR: The average number of ESTs associated with each signal type suggests that variant signals (including the common AUUAAA) are processed less efficiently than the canonical signal and could therefore be selected for regulatory purposes.
Abstract: The formation of mature mRNAs in vertebrates involves the cleavage and polyadenylation of the pre-mRNA, 10-30 nt downstream of an AAUAAA or AUUAAA signal sequence. The extensive cDNA data now available shows that these hexamers are not strictly conserved. In order to identify variant polyadenylation signals on a large scale, we compared over 8700 human 3' untranslated sequences to 157,775 polyadenylated expressed sequence tags (ESTs), used as markers of actual mRNA 3' ends. About 5600 EST-supported putative mRNA 3' ends were collected and analyzed for significant hexameric sequences. Known polyadenylation signals were found in only 73% of the 3' fragments. Ten single-base variants of the AAUAAA sequence were identified with a highly significant occurrence rate, potentially representing 14.9% of the actual polyadenylation signals. Of the mRNAs, 28.6% displayed two or more polyadenylation sites. In these mRNAs, the poly(A) sites proximal to the coding sequence tend to use variant signals more often, while the 3'-most site tends to use a canonical signal. The average number of ESTs associated with each signal type suggests that variant signals (including the common AUUAAA) are processed less efficiently than the canonical signal and could therefore be selected for regulatory purposes. However, the position of the site in the untranslated region may also play a role in polyadenylation rate.

650 citations


Journal ArticleDOI
TL;DR: High throughput gene and EST mapping projects in zebrafish provide a syntenic relationship to the human genome for the majority of the zebra fish genome.
Abstract: The zebrafish is an important vertebrate model for the mutational analysis of genes effecting developmental processes. Understanding the relationship between zebrafish genes and mutations with those of humans will require understanding the syntenic correspondence between the zebrafish and human genomes. High throughput gene and EST mapping projects in zebrafish are now facilitating this goal. Map positions for 523 zebrafish genes and ESTs with predicted human orthologs reveal extensive contiguous blocks of synteny between the zebrafish and human genomes. Eighty percent of genes and ESTs analyzed belong to conserved synteny groups (two or more genes linked in both zebrafish and human) and 56% of all genes analyzed fall in 118 homology segments (uninterrupted segments containing two or more contiguous genes or ESTs with conserved map order between the zebrafish and human genomes). This work now provides a syntenic relationship to the human genome for the majority of the zebrafish genome.

605 citations


Journal ArticleDOI
TL;DR: The goals of this work are to provide a carefully curated alignment of serpin sequences, to describe patterns of conservation and divergence, and to derive a phylogenetic tree expressing the relationships among the members of this family.
Abstract: We present a comprehensive alignment and phylogenetic analysis of the serpins, a superfamily of proteins with known members in higher animals, nematodes, insects, plants, and viruses. We analyze, compare, and classify 219 proteins representative of eight major and eight minor subfamilies, using a novel technique of consensus analysis. Patterns of sequence conservation characterize the family as a whole, with a clear relationship to the mechanism of function. Variations of these patterns within phylogenetically distinct groups can be correlated with the divergence of structure and function. The goals of this work are to provide a carefully curated alignment of serpin sequences, to describe patterns of conservation and divergence, and to derive a phylogenetic tree expressing the relationships among the members of this family. We extend earlier studies by Huber and Carrell as well as by Marshall, after whose publication the serpin family has grown functionally, taxonomically, and structurally. We used gene and protein sequence data, crystal structures, and chromosomal location where available. The results illuminate structure-function relationships in serpins, suggesting roles for conserved residues in the mechanism of conformational change. The phylogeny provides a rational evolutionary framework to classify serpins and enables identification of conserved amino acids. Patterns of conservation also provide an initial point of comparison for genes identified by the various genome projects. New homologs emerging from sequencing projects can either take their place within the current classification or, if necessary, extend it.

583 citations


Journal ArticleDOI
TL;DR: Qualitatively, it is observed that the functional interactions between genes are stronger as the requirements for physical neighborhood on the genome are more stringent, while the fraction of potential false positives decreases.
Abstract: Various new methods have been proposed to predict functional interactions between proteins based on the genomic context of their genes. The types of genomic context that they use are Type I: the fusion of genes; Type II: the conservation of gene-order or co-occurrence of genes in potential operons; and Type III: the co-occurrence of genes across genomes (phylogenetic profiles). Here we compare these types for their coverage, their correlations with various types of functional interaction, and their overlap with homology-based function assignment. We apply the methods to Mycoplasma genitalium, the standard benchmarking genome in computational and experimental genomics. Quantitatively, conservation of gene order is the technique with the highest coverage, applying to 37% of the genes. By combining gene order conservation with gene fusion (6%), the co-occurrence of genes in operons in absence of gene order conservation (8%), and the co-occurrence of genes across genomes (11%), significant context information can be obtained for 50% of the genes (the categories overlap). Qualitatively, we observe that the functional interactions between genes are stronger as the requirements for physical neighborhood on the genome are more stringent, while the fraction of potential false positives decreases. Moreover, only in cases in which gene order is conserved in a substantial fraction of the genomes, in this case six out of twenty-five, does a single type of functional interaction (physical interaction) clearly dominate (>80%). In other cases, complementary function information from homology searches, which is available for most of the genes with significant genomic context, is essential to predict the type of interaction. Using a combination of genomic context and homology searches, new functional features can be predicted for 10% of M. genitalium genes.

500 citations


Journal ArticleDOI
TL;DR: The findings suggest that the differences between coding and noncoding microsatellite frequencies arise from specific selection against frameshift mutations in coding regions resulting from length changes in nontriplet repeats.
Abstract: Microsatellite enrichment is an excess of repetitive sequences characteristic to all studied eukaryotes. It is thought to result from the accumulated effects of replication slippage mutations. Enrichment is commonly measured as the ratio of the observed frequency of microsatellites to the frequency expected to result from random association of nucleotides. We have compared enrichment of specific types of microsatellites in coding sequences with those in noncoding sequences across seven eukaryotic clades. The results reveal consistent differences between coding and noncoding regions, in terms of both the quantity of repetitive DNA and the types present. In noncoding regions, all types of microsatellite (mono-, di-, tri-, tetra-, penta-, and hexanucleotide repeats) are found in excess, and in all cases, these excesses scale in a similar exponential fashion with the length of the microsatellite. This suggests that all types of noncoding repeats are subject to similar mutational and selective processes. Coding repeats, however, appear to be under much stronger and more specific constraints. Tri- and hexanucleotide repeats are found in consistent and significant excess over a wide range of lengths in both coding and noncoding sequences, but other repeat types are much less frequent in coding regions than in noncoding regions. These findings suggest that the differences between coding and noncoding microsatellite frequencies arise from specific selection against frameshift mutations in coding regions resulting from length changes in nontriplet repeats. Furthermore, the excesses of tri- and hexanucleotide coding repeats appear to be controlled primarily by mutation pressure.

Journal ArticleDOI
TL;DR: The general accessibility, simple set-up, and the robust procedure of the array-based genotyping system described here will offer an easy way to increase the throughput of SNP typing in any molecular biology laboratory.
Abstract: This study describes a practical system that allows high-throughput genotyping of single nucleotide polymorphisms (SNPs) and detection of mutations by allele-specific extension on primer arrays. The method relies on the sequence-specific extension of two immobilized allele-specific primers that differ at their 3′-nucleotide defining the alleles, by a reverse transcriptase (RT) enzyme at optimized reaction conditions. We show the potential of this simple one-step procedure performed on spotted primer arrays of low redundancy by generating over 8000 genotypes for 40 mutations or SNPs. The genotypes formed three easily identifiable clusters and all known genotypes were assigned correctly. Higher degrees of multiplexing will be possible with this system as the power of discrimination between genotypes remained unaltered in the presence of over 100 amplicons in a single reaction. The enzyme-assisted reaction provides highly specific allele distinction, evidenced by its ability to detect minority sequence variants present in 5% of a sample at multiple sites. The assay format based on miniaturized reaction chambers at standard 384-well spacing on microscope slides carrying arrays with two primers per SNP for 80 samples results in low consumption of reagents and makes parallel analysis of a large number of samples convenient. In the assay one or two fluorescent nucleotide analogs are used as labels, and thus the genotyping results can be interpreted with presently available array scanners and software. The general accessibility, simple set-up, and the robust procedure of the array-based genotyping system described here will offer an easy way to increase the throughput of SNP typing in any molecular biology laboratory.

Journal ArticleDOI
TL;DR: The compact physical size of the chicken genome, combined with the large size of its genetic map and the observed degree of conserved synteny, makes the chicken a valuable model organism in the genomics as well as the postgenomics era.
Abstract: A consensus linkage map has been developed in the chicken that combines all of the genotyping data from the three available chicken mapping populations. Genotyping data were contributed by the laboratories that have been using the East Lansing and Compton reference populations and from the Animal Breeding and Genetics Group of the Wageningen University using the Wageningen/Euribrid population. The resulting linkage map of the chicken genome contains 1889 loci. A framework map is presented that contains 480 loci ordered on 50 linkage groups. Framework loci are defined as loci whose order relative to one another is supported by odds greater then 3. The possible positions of the remaining 1409 loci are indicated relative to these framework loci. The total map spans 3800 cM, which is considerably larger than previous estimates for the chicken genome. Furthermore, although the physical size of the chicken genome is threefold smaller then that of mammals, its genetic map is comparable in size to that of most mammals. The map contains 350 markers within expressed sequences, 235 of which represent identified genes or sequences that have significant sequence identity to known genes. This improves the contribution of the chicken linkage map to comparative gene mapping considerably and clearly shows the conservation of large syntenic regions between the human and chicken genomes. The compact physical size of the chicken genome, combined with the large size of its genetic map and the observed degree of conserved synteny, makes the chicken a valuable model organism in the genomics as well as the postgenomics era. The linkage maps, the two-point lod scores, and additional information about the loci are available at web sites in Wageningen (http://www.zod.wau.nl/vf/research/chicken/frame_chicken.html) and East Lansing (http://poultry.mph.msu.edu/).

Journal ArticleDOI
TL;DR: By providing a means for SNP genotyping up to thousands of samples simultaneously, inexpensively, and reproducibly, this method is a powerful strategy for detecting meaningful polymorphic differences in candidate gene association studies and genome-wide linkage disequilibrium scans.
Abstract: We have developed an accurate, yet inexpensive and high-throughput, method for determining the allele frequency of biallelic polymorphisms in pools of DNA samples. The assay combines kinetic (real-time quantitative) PCR with allele-specific amplification and requires no post-PCR processing. The relative amounts of each allele in a sample are quantified. This is performed by dividing equal aliquots of the pooled DNA between two separate PCR reactions, each of which contains a primer pair specific to one or the other allelic SNP variant. For pools with equal amounts of the two alleles, the two amplifications should reach a detectable level of fluorescence at the same cycle number. For pools that contain unequal ratios of the two alleles, the difference in cycle number between the two amplification reactions can be used to calculate the relative allele amounts. We demonstrate the accuracy and reliability of the assay on samples with known predetermined SNP allele frequencies from 5% to 95%, including pools of both human and mouse DNAs using eight different SNPs altogether. The accuracy of measuring known allele frequencies is very high, with the strength of correlation between measured and known frequencies having an r2 = 0.997. The loss of sensitivity as a result of measurement error is typically minimal, compared with that due to sampling error alone, for population samples up to 1000. We believe that by providing a means for SNP genotyping up to thousands of samples simultaneously, inexpensively, and reproducibly, this method is a powerful strategy for detecting meaningful polymorphic differences in candidate gene association studies and genome-wide linkage disequilibrium scans.

Journal ArticleDOI
TL;DR: A novel comparative proteomic approach for assembling human gene contigs and assisting gene discovery in Caenorhabditis elegans was presented, and over 150 putative full-length human gene transcripts were assembled upon further database analyses.
Abstract: Modern biomedical research greatly benefits from large-scale genome-sequencing projects ranging from studies of viruses, bacteria, and yeast to multicellular organisms, like Caenorhabditis elegans. Comparative genomic studies offer a vast array of prospects for identification and functional annotation of human ortholog genes. We presented a novel comparative proteomic approach for assembling human gene contigs and assisting gene discovery. The C. elegans proteome was used as an alignment template to assist in novel human gene identification from human EST nucleotide databases. Among the available 18,452 C. elegans protein sequences, our results indicate that at least 83% (15,344 sequences) of C. elegans proteome has human homologous genes, with 7,954 records of C. elegans proteins matching known human gene transcripts. Only 11% or less of C. elegans proteome contains nematode-specific genes. We found that the remaining 7,390 sequences might lead to discoveries of novel human genes, and over 150 putative full-length human gene transcripts were assembled upon further database analyses. [The sequence data described in this paper have been submitted to the

Journal ArticleDOI
TL;DR: A public gene expression data repository and online data access and analysis, WWW and FTP sites for serial analysis of gene expression (SAGE) data, and the organization and use of this resource are described.
Abstract: Gene expression quantifying techniques promise to shape our understanding of the distribution and regulation of the products of transcription in normal and abnormal cell types. cDNA microarray (DeRisi 1997), high-density oligo DNA array (Wodicka 1997) and serial analysis of gene expression (Velculescu 1995) techniques have all been developed to quickly and efficiently survey genome-wide transcript expression. However, each of these techniques has the potential to produce, in a single experiment, vast amounts of data which must be sifted and ordered for useful information to become apparent. Additional challenges are met when attempts are made to compare, merge and contrast data from experiments conducted under differing conditions and locales. As a prototype for the handling, analysis and exchange of gene expression data in the public forum, we have undertaken the production of a public repository and resource for a particular set of gene expression data, i.e., serial analysis of gene expression (SAGE) data. This repository was designed initially to archive SAGE data produced through the Cancer Genome Anatomy Project (CGAP) (Strausberg 1997; http://www.ncbi.nlm.nih.gov/cgap) but is now capable of accepting submissions of SAGE sequence data from any source, without fee or restriction on dissemination or use. It is our goal to provide free and open access to raw SAGE sequence data, precomputed tag extractions, and several modest analysis tools. This resource currently contains over two million tags from 47 SAGE libraries. We call this resource SAGEmap. Its two online components are available via the World Wide Web (http://www.ncbi.nlm.nih.gov/sage) and anonymous FTP (ftp://ncbi.nlm.nih.gov/pub/sage).

Journal ArticleDOI
TL;DR: A homozygous diploid meiotic mapping panel is used to localize polymorphisms in 691 previously unmapped genes and expressed sequence tags (ESTs) and to enhance the understanding of the evolution of the vertebrate genome.
Abstract: Zebrafish mutations define the functions of hundreds of essential genes in the vertebrate genome. To accelerate the molecular analysis of zebrafish mutations and to facilitate comparisons among the genomes of zebrafish and other vertebrates, we used a homozygous diploid meiotic mapping panel to localize polymorphisms in 691 previously unmapped genes and expressed sequence tags (ESTs). Together with earlier efforts, this work raises the total number of markers scored in the mapping panel to 2119, including 1503 genes and ESTs and 616 previously characterized simple-sequence length polymorphisms. Sequence analysis of zebrafish genes mapped in this study and in prior work identified putative human orthologs for 804 zebrafish genes and ESTs. Map comparisons revealed 139 new conserved syntenies, in which two or more genes are on the same chromosome in zebrafish and human. Although some conserved syntenies are quite large, there were changes in gene order within conserved groups, apparently reflecting the relatively frequent occurrence of inversions and other intrachromosomal rearrangements since the divergence of teleost and tetrapod ancestors. Comparative mapping also shows that there is not a one-to-one correspondence between zebrafish and human chromosomes. Mapping of duplicate gene pairs identified segments of 20 linkage groups that may have arisen during a genome duplication that occurred early in the evolution of teleosts after the divergence of teleost and mammalian ancestors. This comparative map will accelerate the molecular analysis of zebrafish mutations and enhance the understanding of the evolution of the vertebrate genome.

Journal ArticleDOI
TL;DR: A highly parallel method for genotyping single nucleotide polymorphisms (SNPs) using generic high-density oligonucleotide arrays that contain thousands of preselected 20-mer oligon nucleotide tags, which can be used for allele-frequency estimation in pooled DNA samples.
Abstract: Large scale human genetic studies require technologies for generating millions of genotypes with relative ease but also at a reasonable cost and with high accuracy We describe a highly parallel method for genotyping single nucleotide polymorphisms (SNPs), using generic high-density oligonucleotide arrays that contain thousands of preselected 20-mer oligonucleotide tags First, marker-specific primers are used in PCR amplifications of genomic regions containing SNPs Second, the amplification products are used as templates in single base extension (SBE) reactions using chimeric primers with 3' complementarity to the specific SNP loci and 5' complementarity to specific probes, or tags, synthesized on the array The SBE primers, terminating one base before the polymorphic site, are extended in the presence of labeled dideoxy NTPs, using a different label for each of the two SNP alleles, and hybridized to the tag array Third, genotypes are deduced from the fluorescence intensity ratio of the two colors This approach takes advantage of multiplexed sample preparation, hybridization, and analysis at each stage We illustrate and test this method by genotyping 44 individuals for 142 human SNPs identified previously in 62 candidate hypertension genes Because the hybridization results are quantitative, this method can also be used for allele-frequency estimation in pooled DNA samples

Journal ArticleDOI
Lynn B. Jorde1
TL;DR: These techniques are summarized and the evolutionary factors that can confound or enhance disequilibrium analysis will be discussed, and some thoughts will be offered on the optimal choice of markers and populations for LD analysis.
Abstract: During the past two decades, linkage analysis has been phenomenally successful in localizing Mendelian disease genes. Linkage disequilibrium (LD) analysis, which effectively incorporates the effects of many past generations of recombination, has often been instrumental in the final phases of gene localization (Feder et al. 1996; Hastbacka et al. 1994; Kerem et al. 1989). These successes have fueled hopes that similar approaches will be effective in localizing genes underlying susceptibility to common, complex diseases. With the exception of Mendelian subsets of common diseases (e.g., BRCA1 and BRCA2 for breast cancer, APC for colon cancer, the LDL receptor gene for heart disease), progress on this front has been limited. Typically, a nonparametric linkage analysis, such as a sib-pair analysis, will implicate several genetic regions as targets for further investigation. These regions, often 10–20 Mb in size, remain intractably large for effective positional cloning. It is now hoped that LD approaches, using hundreds of thousands of new polymorphic markers, will overcome this impasse (Risch and Merikangas 1996). The rationale underlying LD mapping of complex disease genes is straightforward and similar to the justification for LD mapping of Mendelian disease genes. With both types of disease genes, the primary advantage of LD analysis remains its ability to use the effects of dozens or hundreds of past generations of recombination to achieve fine-scale gene localization (Jorde 1995). An important difficulty, common to both types of disease genes, is that past historical events (admixture, genetic drift, multiple mutations, and natural selection) can disturb the relationship between LD and inter-locus physical distance. A major difference, of course, is that locus heterogeneity complicates the analysis of complex diseases and may be more extensive for these diseases than for most Mendelian diseases. Furthermore, allelic heterogeneity may be present at each locus. This heterogeneity, the scope of which is largely unknown, will limit the strength of association between a given polymorphism and an observable phenotype. Despite these challenges, LD mapping holds considerable appeal, and there is great demand to resolve the genetics of complex diseases. Consequently, many new techniques have been devised to carry out LD analysis, often with a view toward mapping complex disease loci. The purpose of this review is to summarize these techniques and some of the issues surrounding their application. In particular, the evolutionary factors that can confound or enhance disequilibrium analysis will be discussed, and some thoughts will be offered on the optimal choice of markers and populations for LD analysis.

Journal ArticleDOI
TL;DR: An approach for recognizing genes within orthologous genomic loci from human and mouse by first aligning the regions using an iterative global alignment system and then identifying genes based on conservation of exonic features at aligned positions in both species.
Abstract: A fundamental task in analyzing genomes is to identify the genes. This is relatively straightforward for organisms with compact genomes (such as bacteria, yeast, flies and worms) because exons tend to be large and the introns are either non-existent or tend to be short. The challenge is much greater for large genomes (such as those of mammals and higher plants), because the exonic 'signal' is scattered in a vast sea of non-genic 'noise'. While coding sequences comprise 75% of the yeast genome, they represent only about 3% of the human genome. Computational approaches have been developed for gene recognition in large genomes, with most employing various statistical tools to identify likely splice sites and to detect tell-tale differences in sequence composition between coding and non-coding DNA (Burset & Guigo 1996). Some programs perform e novo recognition, in that they directly use only information about the input sequence itself. One of the best programs of this sort is GENSCAN (Burge 1997), which uses a Hidden Markov Model to scan large genomic sequences. Other programs employ “homology” approaches, in which exons are identified by comparing a conceptual translation of DNA sequences to databases of known protein sequences (Pachter et al. 1999; Gelfand et al. 1996). In this paper, we explore a powerful new approach to gene recognition by using cross-species sequence comparison, i.e., by simultaneously analyzing homologous loci from two related species. Specifically, we focus on the ability to accurately identify coding exons by comparison of syntenic human and mouse genomic sequences. It is well known that cross-species sequence comparison can help highlight important functional elements such as exons, because such elements tend to be more strongly conserved by evolution than random genomic sequences. If a protein encoded by a gene is already known in one organism, it is relatively simple to search genomic DNA from another organism to identify genes encoding a similar protein (using such computer packages such as Wise2 (http://www.sanger.ac.uk/Software/Wise2). A more challenging problem is to idenitfy exons directly from cross-species comparisons of genomic DNA. Computer programs are available that identify regions of sequence conservation, using simple “dot plots” or more sophisticated “pip plots” (Jang et al. 1999), which can then be individually analyzed in an ad hoc fashion to see whether they may contain such features as exons or regulatory elements. However, these programs simply identify conserved regions and do not systematically use the cross-species information to perform exon recogntion. We sought to develop an automatic approach to exon recognition by using cross-species sequence comparison to identify and align relevant regions and then searching for the presence of exonic features at corresponding positions in both species. We began by undertaking a systematic comparison of the genomic structure of 117 orthologous gene pairs from human and mouse to understand the extent of conservation of the number, length, and sequence of exons and introns. We then used these results to develop algorithms for cross-species gene recognition, consisting of GLASS, a new alignment program designed to provide good global alignments of large genomic regions by using a hierarchical alignment approach, and ROSETTA, a program that identifies coding exons in both species based on coincidence of genomic structure (splice sites, exon number, exon length, coding frame, and sequence similarity). ROSETTA performed extremely well in identifying coding exons, showing 95% sensitivity and 97% specificity at the nucleotide level. The performance was superior to programs that use much more sophisticated signals and statistical analysis but analyze only a single genome (Burset and Guigo 1996, Burge 1997). To our knowledge, ROSETTA is the first program for gene recognition based on cross-species comparison of genomic DNA from two organisms. The approach can be readily generalized to other pairs of organisms, as well as to the study of three or more organisms simultaneously. With the current explosion of knowledge regarding the human and mouse genomic sequences, cross-species comparison is likely to provide one of the most powerful approaches for extracting the information in mammalian genomes.

Journal ArticleDOI
TL;DR: The results show that the accuracy of GeneID predictions compares currently with that of other existing tools but that GeneID is likely to be more efficient in terms of speed and memory usage.
Abstract: GeneID is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, and start and stop codons are predicted and scored along the sequence using position weight matrices (PWMs). In the second step, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites, plus the log-likelihood ratio of a Markov model for coding DNA. In the last step, from the set of predicted exons, the gene structure is assembled, maximizing the sum of the scores of the assembled exons. In this paper we describe the obtention of PWMs for sites, and the Markov model of coding DNA in Drosophila melanogaster. We also compare other models of coding DNA with the Markov model. Finally, we present and discuss the results obtained when GeneID is used to predict genes in the Adh region. These results show that the accuracy of GeneID predictions compares currently with that of other existing tools but that GeneID is likely to be more efficient in terms of speed and memory usage.

Journal ArticleDOI
TL;DR: Investigation indicates that many of the incorrect gene predictions from GeneWise were due to transposons with valid protein-coding genes and the remaining cases are pseudogenes or possible annotation oversights.
Abstract: The GeneWise method for combining gene prediction and homology searches was applied to the 2.9-Mb region from Drosophila melanogaster. The results from the Genome Annotation Assessment Project (GASP) showed that GeneWise provided reasonably accurate gene predictions. Further investigation indicates that many of the incorrect gene predictions from GeneWise were due to transposons with valid protein-coding genes and the remaining cases are pseudogenes or possible annotation oversights.

Journal ArticleDOI
TL;DR: The observation of long range disequilibrium between syntenic loci using low-density marker maps indicates that LD mapping has the potential to be very effective in livestock populations.
Abstract: A genome-wide linkage disequilibrium (LD) map was generated using microsatellite genotypes (284 autosomal microsatellite loci) of 581 gametes sampled from the dutch black-and-white dairy cattle population. LD was measured between all marker pairs, both syntenic and nonsyntenic. Analysis of syntenic pairs revealed surprisingly high levels of LD that, although more pronounced for closely linked marker pairs, extended over several tens of centimorgan. In addition, significant gametic associations were also shown to be very common between nonsyntenic loci. Simulations using the known genealogies of the studied sample indicate that random drift alone is likely to account for most of the observed disequilibrium. No clear evidence was obtained for a direct effect of selection ("Bulmer effect"). The observation of long range disequilibrium between syntenic loci using low-density marker maps indicates that LD mapping has the potential to be very effective in livestock populations. The frequent occurrence of gametic associations between nonsyntenic loci, however, encourages the combined use of linkage and linkage disequilibrium methods to avoid false positive results when mapping genes in livestock.

Journal ArticleDOI
TL;DR: A statistical procedure for predicting whether genes of a complete genome have been acquired by horizontal gene transfer, based on the analysis of G+C contents, codon usage, amino acid usage, and gene position, found that informational genes were less likely to be transferred than operational genes.
Abstract: There is growing evidence that horizontal gene transfer is a potent evolutionary force in prokaryotes, although exactly how potent is not known. We have developed a statistical procedure for predicting whether genes of a complete genome have been acquired by horizontal gene transfer. It is based on the analysis of G+C contents, codon usage, amino acid usage, and gene position. When we applied this procedure to 17 bacterial complete genomes and seven archaeal ones, we found that the percentage of horizontally transferred genes varied from 1.5% to 14.5%. Archaea and nonpathogenic bacteria had the highest percentages and pathogenic bacteria, except for Mycoplasma genitalium, had the lowest. As reported in the literature, we found that informational genes were less likely to be transferred than operational genes. Most of the horizontally transferred genes were only present in one or two lineages. Some of these transferred genes include genes that form part of prophages, pathogenecity islands, transposases, integrases, recombinases, genes present only in one of the two Helicobacter pylori strains, and regions of genes functionally related. All of these findings support the important role of horizontal gene transfer in the molecular evolution of microorganisms and speciation.

Journal ArticleDOI
TL;DR: The new features in FPC are described, the scenario for building the maps of chromosomes 9, 10 and 13, and the results from the simulation are described.
Abstract: Contigs have been assembled, and over 2800 clones selected for sequencing for human chromosomes 9, 10 and 13. Using the FPC (FingerPrinted Contig) software, the contigs are assembled with markers and complete digest fingerprints, and the contigs are ordered and localised by a global framework. Publicly available resources have been used, such as, the 1998 International Gene Map for the framework and the GSC Human BAC fingerprint database for the majority of the fingerprints. Additional markers and fingerprints are generated in-house to supplement this data. To support the scale up of building maps, FPC V4.7 has been extended to use markers with the fingerprints for assembly of contigs, new clones and markers can be automatically added to existing contigs, and poorly assembled contigs are marked accordingly. To test the automatic assembly, a simulated complete digest of 110 Mb of concatenated human sequence was used to create datasets with varying coverage, length of clones, and types of error. When no error was introduced and a tolerance of 7 was used in assembly, the largest contig with no false positive overlaps has 9534 clones with 37 out-of-order clones, that is, the starting coordinates of adjacent clones are in the wrong order. This paper describes the new features in FPC, the scenario for building the maps of chromosomes 9, 10 and 13, and the results from the simulation.

Journal ArticleDOI
TL;DR: Human and mouse genomic sequence comparisons are being increasingly used to search for evolutionarily conserved gene regulatory elements and a question remains as to whether most of these noncoding sequences are conserved because of functional constraints or are the result of a lack of divergence time.
Abstract: Human and mouse genomic sequence comparisons are being increasingly used to search for evolutionarily conserved gene regulatory elements. Large-scale human-mouse DNA comparison studies have discovered numerous conserved noncoding sequences of which only a fraction has been functionally investigated A question therefore remains as to whether most of these noncoding sequences are conserved because of functional constraints or are the result of a lack of divergence time.

Journal ArticleDOI
TL;DR: A strategy to prepare normalized and subtracted cDNA libraries in a single step based on hybridization of the first-strand, full-length cDNA with several RNA drivers, including starting mRNA as the normalizing driver and run-off transcripts from minilibraries containing highly expressed genes, rearrayed clones, and previously sequenced cDNAs as subtracting drivers.
Abstract: In the effort to prepare the mouse full-length cDNA encyclopedia, we previously developed several techniques to prepare and select full-length cDNAs. To increase the number of different cDNAs, we introduce here a strategy to prepare normalized and subtracted cDNA libraries in a single step. The method is based on hybridization of the first-strand, full-length cDNA with several RNA drivers, including starting mRNA as the normalizing driver and run-off transcripts from minilibraries containing highly expressed genes, rearrayed clones, and previously sequenced cDNAs as subtracting drivers. Our method keeps the proportion of full-length cDNAs in the subtracted/normalized library high. Moreover, our method dramatically enhances the discovery of new genes as compared to results obtained by using standard, full-length cDNA libraries. This procedure can be extended to the preparation of full-length cDNA encyclopedias from other organisms.

Journal ArticleDOI
TL;DR: The low rate of chimerism, approximately 1%, and the low level of detected rearrangements support the anticipated usefulness of the BAC libraries for genome research.
Abstract: Bacterial artificial chromosome (BAC) and P1-derived artificial chromosome (PAC) libraries providing a combined 33-fold representation of the murine genome have been constructed using two different restriction enzymes for genomic digestion. A large-insert PAC library was prepared from the 129S6/SvEvTac strain in a bacterial/mammalian shuttle vector to facilitate functional gene studies. For genome mapping and sequencing, we prepared BAC libraries from the 129S6/SvEvTac and the C57BL/6J strains. The average insert sizes for the three libraries range between 130 kb and 200 kb. Based on the numbers of clones and the observed average insert sizes, we estimate each library to have slightly in excess of 10-fold genome representation. The average number of clones found after hybridization screening with 28 probes was in the range of 9-14 clones per marker. To explore the fidelity of the genomic representation in the three libraries, we analyzed three contigs, each established after screening with a single unique marker. New markers were established from the end sequences and screened against all the contig members to determine if any of the BACs and PACs are chimeric or rearranged. Only one chimeric clone and six potential deletions have been observed after extensive analysis of 113 PAC and BAC clones. Seventy-one of the 113 clones were conclusively nonchimeric because both end markers or sequences were mapped to the other confirmed contig members. We could not exclude chimerism for the remaining 41 clones because one or both of the insert termini did not contain unique sequence to design markers. The low rate of chimerism, approximately 1%, and the low level of detected rearrangements support the anticipated usefulness of the BAC libraries for genome research.

Journal ArticleDOI
TL;DR: Evidence is presented that in three other possible transitions of LTR retrotransposons to retroviruses, an envelope-like gene was acquired from a viral source, representing the only cases in which the env gene of a retrovirus has been traced back to its original source.
Abstract: Phylogenetic analyses suggest that long-terminal repeat (LTR) bearing retrotransposable elements can acquire additional open-reading frames that can enable them to mediate infection. Whereas this process is best documented in the origin of the vertebrate retroviruses and their acquisition of an envelope (env) gene, similar independent events may have occurred in insects, nematodes, and plants. The origins of env-like genes are unclear, and are often masked by the antiquity of the original acquisitions and by their rapid rate of evolution. In this report, we present evidence that in three other possible transitions of LTR retrotransposons to retroviruses, an envelope-like gene was acquired from a viral source. First, the gypsy and related LTR retrotransposable elements (the insect errantiviruses) have acquired their envelope-like gene from a class of insect baculoviruses (double-stranded DNA viruses with no RNA stage). Second, the Cer retroviruses in the Caenorhabditis elegans genome acquired their envelope gene from a Phleboviral (single ambisense-stranded RNA viruses) source. Third, the Tas retroviral envelope (Ascaris lumricoides) may have been obtained from Herpesviridae (double-stranded DNA viruses, no RNA stage). These represent the only cases in which the env gene of a retrovirus has been traced back to its original source. This has implications for the evolutionary history of retroviruses as well as for the potential ability of all LTR-retrotransposable elements to become infectious agents.

Journal ArticleDOI
TL;DR: The data suggest that intraelement recombination events deleted most of the original retrotransposon sequences, thereby providing a possible mechanism to counteract retroelement-driven genome expansion.
Abstract: Organisms with large genomes contain vast amounts of repetitive DNA sequences, much of which is composed of retrotransposons. Amplification of retrotransposons has been postulated to be a major mechanism increasing genome size and leading to "genomic obesity." To gain insights into the relation between retrotransposons and genome expansion in a large genome, we have studied a 66-kb contiguous sequence at the Rar1 locus of barley in detail. Three genes were identified in the 66-kb contig, clustered within an interval of 18 kb. Inspection of sequences flanking the gene space unveiled four novel retroelements, designated Nikita, Sukkula, Sabrina, and BAGY-2 and several units of the known BARE-1 element. The retroelements identified are responsible for at least 15 integration events, predominantly arranged as multiple nested insertions. Strikingly, most of the retroelements exist as solo LTRs (Long Terminal Repeats), indicating that unequal crossing over and/or intrachromosomal recombination between LTRs is a common feature in barley. Our data suggest that intraelement recombination events deleted most of the original retrotransposon sequences, thereby providing a possible mechanism to counteract retroelement-driven genome expansion.