scispace - formally typeset
Search or ask a question

Showing papers in "Genome Research in 1999"


Journal ArticleDOI
TL;DR: The third generation of the CAP sequence assembly program is described, which has a capability to clip 5' and 3' low-quality regions of reads and uses forward-reverse constraints to correct assembly errors and link contigs.
Abstract: The shotgun sequencing strategy has been used widely in genome sequencing projects. A major phase in this strategy is to assemble short reads into long sequences. A number of DNA sequence assembly programs have been developed (Staden 1980; Peltola et al. 1984; Huang 1992; Smith et al. 1993; Gleizes and Henaut 1994; Lawrence et al. 1994; Kececioglu and Myers 1995; Sutton et al. 1995; Green 1996). The FAKII program provides a library of routines for each phase of the assembly process (Larson et al. 1996). The GAP4 program has a number of useful interactive features (Bonfield et al. 1995). The PHRAP program clips 5′ and 3′ low-quality regions of reads and uses base quality values in evaluation of overlaps and generation of contig sequences (Green 1996). TIGR Assembler has been used in a number of megabase microbial genome projects (Sutton et al. 1995). Continued development and improvement of sequence assembly programs are required to meet the challenges of the human, mouse, and maize genome projects. We have developed the third generation of the CAP sequence assembly program (Huang 1992). The CAP3 program includes a number of improvements and new features. A capability to clip 5′ and 3′ low-quality regions of reads is included in the CAP3 program. Base quality values produced by PHRED (Ewing et al. 1998) are used in computation of overlaps between reads, construction of multiple sequence alignments of reads, and generation of consensus sequences. Efficient algorithms are employed to identify and compute overlaps between reads. Forward–reverse constraints are used to correct assembly errors and link contigs. Results of CAP3 on four BAC data sets are presented. The performance of CAP3 was compared with that of PHRAP on a number of BAC data sets. PHRAP often produces longer contigs than CAP3 whereas CAP3 often produces fewer errors in consensus sequences than PHRAP. It is easier to construct scaffolds with CAP3 than with PHRAP on low-pass data with forward–reverse constraints. An unusual feature of CAP3 is the use of forward–reverse constraints in the construction of contigs. A forward–reverse constraint is often produced by sequencing of both ends of a subclone. A forward–reverse constraint specifies that the two reads should be on the opposite strands of the DNA molecule within a specified range of distance. By sequencing both ends of each subclone, a large number of forward–reverse constraints are produced for a cosmid or BAC data set. A difficulty with use of forward–reverse constraints in assembly is that some of the forward–reverse constraints are incorrect because of errors in lane tracking and cloning. Our strategy for dealing with this difficulty is based on the observation that a majority of the constraints are correct and wrong constraints usually occur randomly. Thus, a few unsatisfied constraints in a contig may not be sufficient to indicate an assembly error in the contig. However, if a sufficient number of constraints are all inconsistent with a join in a contig and all support an alternative join, it is likely that the current join is an error, and the alternative join should be made.

5,074 citations


Journal ArticleDOI
TL;DR: Whole-genome analysis indicates that this class of proteins is ancient and has undergone considerable functional divergence prior to the emergence of the major divisions of life.
Abstract: Using a combination of computer methods for iterative database searches and multiple sequence alignment, we show that protein sequences related to the AAA family of ATPases are far more prevalent than reported previously. Among these are regulatory components of Lon and Clp proteases, proteins involved in DNA replication, recombination, and restriction (including subunits of the origin recognition complex, replication factor C proteins, MCM DNA-licensing factors and the bacterial DnaA, RuvB, and McrB proteins), prokaryotic NtrC-related transcription regulators, the Bacillus sporulation protein SpoVJ, Mg2+, and Co2+ chelatases, the Halobacterium GvpN gas vesicle synthesis protein, dynein motor proteins, TorsinA, and Rubisco activase. Alignment of these sequences, in light of the structures of the clamp loader delta' subunit of Escherichia coli DNA polymerase III and the hexamerization component of N-ethylmaleimide-sensitive fusion protein, provides structural and mechanistic insights into these proteins, collectively designated the AAA+ class. Whole-genome analysis indicates that this class is ancient and has undergone considerable functional divergence prior to the emergence of the major divisions of life. These proteins often perform chaperone-like functions that assist in the assembly, operation, or disassembly of protein complexes. The hexameric architecture often associated with this class can provide a hole through which DNA or RNA can be thread; this may be important for assembly or remodeling of DNA-protein complexes.

1,830 citations


Journal ArticleDOI
TL;DR: A similarity measure that reduces the number of false positives, a new clustering algorithm designed specifically for grouping gene expression patterns, and an interactive graphical cluster analysis tool that allows user feedback and validation are described.
Abstract: Analysis procedures are needed to extract useful information from the large amount of gene expression data that is becoming available. This work describes a set of analytical tools and their application to yeast cell cycle data. The components of our approach are (1) a similarity measure that reduces the number of false positives, (2) a new clustering algorithm designed specifically for grouping gene expression patterns, and (3) an interactive graphical cluster analysis tool that allows user feedback and validation. We use the clusters generated by our algorithm to summarize genome-wide expression and to initiate supervised clustering of genes into biologically meaningful groups.

1,228 citations


Journal ArticleDOI
TL;DR: The dbSNP database is designed to serve as a general catalog of molecular variation to supplement GenBank and can include a broad range of molecular polymorphisms: single base nucleotide substitutions, short deletion and insertion polymorphisms, microsatellite markers, and polymorphic insertion elements such as retrotransposons.
Abstract: A key aspect of research in genetics is associating sequence variations with heritable phenotypes. The most common variations are single nucleotide polymorphisms (SNPs), which occur approximately once every 500–1000 bases in a large sample of aligned human sequence. Because SNPs are expected to facilitate large-scale association genetics studies, there has recently been great interest in SNP discovery and detection. In collaboration with the National Human Genome Research Institute (NHGRI), the National Center for Biotechnology Information (NCBI) has established the dbSNP database (http://www.ncbi.nlm. nih.gov/SNP) to serve as a central repository for molecular variation. Designed to serve as a general catalog of molecular variation to supplement GenBank (Benson et al. 1999) database submissions can include a broad range of molecular polymorphisms: single base nucleotide substitutions, short deletion and insertion polymorphisms, microsatellite markers, and polymorphic insertion elements such as retrotransposons. Although the name dbSNP is a slight misnomer given the variations represented, SNP polymorphisms are the largest class of variation in the database, and the name dbSNP, selected at the request of NHGRI, reflects this fact. For the sake of brevity, we elected to use the term SNP as a shorthand for “variation” in the database notation and documentation (http://www.ncbi.nlm.nih.gov/ SNP/get_html.cgi?whichHtml=how_to_ submit). Thus terms used in the documentation like “submitted SNP” or “reference SNP” refer to all classes of variation in the database and should be regarded as meaning “a submitted report of variation” and “a reference report of variation.” Furthermore, it should be noted that in serving its role as the variation complement to GenBank, dbSNP does not restrict submissions to only neutral polymorphisms. Submissions are welcome on all classes of simple molecular variation, including those that cause rare clinical phenotypes. Submissions to dbSNP come from a variety of sources including individual laboratories, collaborative polymorphism discovery efforts, large-scale genome sequencing centers, and private industry. The data collected range from the tightly focused characterization of particular genes to broadly sampled levels of variation from random genomic sequence. The distribution of reported marker density across the genome is thus expected to be mixed, with an expected minimum density of 1/3000 bases in regions of random genomic sequence, and local regions of higher density around well-characterized genes. Each variation submitted to dbSNP must have an identifier provided by the submitter (called a “local” identifier by dbSNP), and each is issued a unique identifier, formatted as an integer prefixed with ss (for submitted SNP), for example, ss334. An ss number is thus permanently associated with the submitter’s identifier, and it can be treated as a formal accession number by the scientific publishing community.

615 citations


Journal ArticleDOI
TL;DR: First inventory of exon-intron structures of known human genes using EST contigs from the TIGR Human Gene Index shows evidence of alternative splicing in 35% of genes and the majority of splicing events occurred in 5' untranslated regions, suggesting wide occurrence of alternative regulation.
Abstract: Alternative splicing can produce variant proteins and expression patterns as different as the products of different genes, yet the prevalence of alternative splicing has not been quantified. Here the spliced alignment algorithm was used to make a first inventory of exon-intron structures of known human genes using EST contigs from the TIGR Human Gene Index. The results on any one gene may be incomplete and will require verification, yet the overall trends are significant. Evidence of alternative splicing was shown in 35% of genes and the majority of splicing events occurred in 5' untranslated regions, suggesting wide occurrence of alternative regulation. Most of the alternative splices of coding regions generated additional protein domains rather than alternating domains.

570 citations


Journal ArticleDOI
TL;DR: Estimates of 4Nc for a number of gene regions and human populations will be of use in determining the density of SNPs that are likely to be required for successful association studies.
Abstract: The statistical power of five association study test statistics (two haplotype-based tests, two marker-based tests, and the Transmission Disequilibrium Test-Q5) to detect single nucleotide polymorphism (SNP)/phenotype associations in a linkage-disequilibrium-based candidate gene scan employing a number of SNPs is examined. Power is estimated as a function of realistic parameters expected to affect the likelihood of detecting a significant association: the number of SNPs examined, the scaled recombination size of the region examined, the proportion of variance in the trait attributable to a hidden causative polymorphism within the region, and the number of individuals or families examined. For the different combinations of parameter values, power is estimated from a large number of realizations of a simulated coalescent describing a single random mating population with mutation, random genetic drift, and recombination. This explicit population genetics model results in a distribution of DNA marker heterozygosities and linkage disequilibria that are likely to resemble those expected in actual population samples. The study concludes that (1) marker-based permutation tests are more powerful than simple haplotype-based tests, (2) there is sufficient power to detect the presence of causative polymorphisms of small effect if on the order of 500 individuals are sampled, (3) greater power is achieved by increasing the sample size than by increasing the number of polymorphisms, (4) association studies are generally more powerful than transmission disequilibrium-based tests, and (5) for the range of parameters considered association studies have a low repeatability unless sample sizes are on the order of 500 individuals. Estimates of 4Nc for a number of gene regions and human populations will be of use in determining the density of SNPs that are likely to be required for successful association studies.

455 citations


Journal ArticleDOI
TL;DR: This homogeneous DNA diagnostic method is shown to be highly sensitive and specific and is suitable for automated genotyping of large number of samples.
Abstract: A new method for DNA diagnostics based on template-directed primer extension and detection by fluorescence polarization is described. In this method, amplified genomic DNA containing a polymorphic locus is incubated with oligonucleotide primers (designed to hybridize to the DNA template adjacent to the polymorphic site) in the presence of allele-specific dye-labeled dideoxyribonucleoside triphosphates and a commercially available modified Taq DNA polymerase. The primer is extended by the dye-terminator specific for the allele present on the template, increasing approximately 10-fold the molecular weight of the fluorophore. At the end of the reaction, the fluorescence polarization of the two dye-terminators in the reaction mixture are analyzed directly without separation or purification. This homogeneous DNA diagnostic method is shown to be highly sensitive and specific and is suitable for automated genotyping of large number of samples. [The data shown in Figure 3 are available as an online supplement at.]

423 citations


Journal ArticleDOI
TL;DR: No consistent large-scale bacterial phylogeny could be established, and argument is presented that, although lineage-specific gene loss might have contributed to the evolution of some of the aaRSs, this is not a viable alternative to horizontal gene transfer as the principal evolutionary phenomenon in this gene class.
Abstract: Phylogenetic analysis of aminoacyl-tRNA synthetases (aaRSs) of all 20 specificities from completely sequenced bacterial, archaeal, and eukaryotic genomes reveals a complex evolutionary picture. Detailed examination of the domain architecture of aaRSs using sequence profile searches delineated a network of partially conserved domains that is even more elaborate than previously suspected. Several unexpected evolutionary connections were identified, including the apparent origin of the beta-subunit of bacterial GlyRS from the HD superfamily of hydrolases, a domain shared by bacterial AspRS and the B subunit of archaeal glutamyl-tRNA amidotransferases, and another previously undetected domain that is conserved in a subset of ThrRS, guanosine polyphosphate hydrolases and synthetases, and a family of GTPases. Comparison of domain architectures and multiple alignments resulted in the delineation of synapomorphies-shared derived characters, such as extra domains or inserts-for most of the aaRSs specificities. These synapomorphies partition sets of aaRSs with the same specificity into two or more distinct and apparently monophyletic groups. In conjunction with cluster analysis and a modification of the midpoint-rooting procedure, this partitioning was used to infer the likely root position in phylogenetic trees. The topologies of the resulting rooted trees for most of the aaRSs specificities are compatible with the evolutionary "standard model" whereby the earliest radiation event separated bacteria from the common ancestor of archaea and eukaryotes as opposed to the two other possible evolutionary scenarios for the three major divisions of life. For almost all aaRSs specificities, however, this simple scheme is confounded by displacement of some of the bacterial aaRSs by their eukaryotic or, less frequently, archaeal counterparts. Displacement of ancestral eukaryotic aaRS genes by bacterial ones, presumably of mitochondrial origin, was observed for three aaRSs. In contrast, there was no convincing evidence of displacement of archaeal aaRSs by bacterial ones. Displacement of aaRS genes by eukaryotic counterparts is most common among parasitic and symbiotic bacteria, particularly the spirochaetes, in which 10 of the 19 aaRSs seem to have been displaced by the respective eukaryotic genes and two by the archaeal counterpart. Unlike the primary radiation events between the three main divisions of life, that were readily traceable through the phylogenetic analysis of aaRSs, no consistent large-scale bacterial phylogeny could be established. In part, this may be due to additional gene displacement events among bacterial lineages. Argument is presented that, although lineage-specific gene loss might have contributed to the evolution of some of the aaRSs, this is not a viable alternative to horizontal gene transfer as the principal evolutionary phenomenon in this gene class.

420 citations


Journal ArticleDOI
TL;DR: This work has assembled 300,000 distinct sequences and identified 850 mismatches from contiguous EST data sets (candidate SNP sites) and confirmed the presence of a subset of these candidate SNP sites and estimated the allele frequencies in three human populations with different ethnic origins.
Abstract: There is considerable interest in the discovery and characterization of single nucleotide polymorphisms (SNPs) to enable the analysis of the potential relationships between human genotype and phenotype. Here we present a strategy that permits the rapid discovery of SNPs from publicly available expressed sequence tag (EST) databases. From a set of ESTs derived from 19 different cDNA libraries, we assembled 300,000 distinct sequences and identified 850 mismatches from contiguous EST data sets (candidate SNP sites), without de novo sequencing. Through a polymerase-mediated, single-base, primer extension technique, Genetic Bit Analysis (GBA), we confirmed the presence of a subset of these candidate SNP sites and have estimated the allele frequencies in three human populations with different ethnic origins. Altogether, our approach provides a basis for rapid and efficient regional and genome-wide SNP discovery using data assembled from sequences from different libraries of cDNAs. [The SNPs identified in this study can be found in the National Center of Biotechnology (NCBI) SNP database under submitter handles ORCHID (SNPS-981210-A) and debnick (SNPS-981209-A and SNPS-981209-B).]

351 citations


Journal ArticleDOI
TL;DR: It is shown that computer analyses of plant EST data can be used to generate evidence of correlated expression patterns of genes across various tissues, and tissue types and organs can be classified with respect to one another on the basis of their global gene expression patterns.
Abstract: Large, publicly available collections of expressed sequence tags (ESTs) have been generated from Arabidopsis thaliana and rice (Oryza sativa). A potential, but relatively unexplored application of this data is in the study of plant gene expression. Other EST data, mainly from human and mouse, have been successfully used to point out genes exhibiting tissue- or disease-specific expression, as well as for identification of alternative transcripts. In this report, we go a step further in showing that computer analyses of plant EST data can be used to generate evidence of correlated expression patterns of genes across various tissues. Furthermore, tissue types and organs can be classified with respect to one another on the basis of their global gene expression patterns. As in previous studies, expression profiles are first estimated from EST counts. By clustering gene expression profiles or whole cDNA library profiles, we show that genes with similar functions, or cDNA libraries expected to share patterns of gene expression, are grouped together. Promising uses of this technique include functional genomics, in which evidence of correlated expression might complement (or substitute for) those of sequence similarity in the annotation of anonymous genes and identification of surrogate markers. The analysis presented here combines the application of a correlation-based clustering method with a graphical color map allowing intuitive visualization of patterns within a large table of expression measurements.

253 citations


Journal ArticleDOI
TL;DR: A comparison of the protein sequences encoded in the four euryarchaeal species whose genomes have been sequenced completely revealed 1326 orthologous sets, of which 543 are represented in all four species.
Abstract: Comparative analysis of the protein sequences encoded in the four euryarchaeal species whose genomes have been sequenced completely (Methanococcus jannaschii, Methanobacterium thermoautotrophicum, Archaeoglobus fulgidus, and Pyrococcus horikoshii) revealed 1326 orthologous sets, of which 543 are represented in all four species. The proteins that belong to these conserved euryarchaeal families comprise 31%-35% of the gene complement and may be considered the evolutionarily stable core of the archaeal genomes. The core gene set includes the great majority of genes coding for proteins involved in genome replication and expression, but only a relatively small subset of metabolic functions. For many gene families that are conserved in all euryarchaea, previously undetected orthologs in bacteria and eukaryotes were identified. A number of euryarchaeal synapomorphies (unique shared characters) were identified; these are protein families that possess sequence signatures or domain architectures that are conserved in all euryarchaea but are not found in bacteria or eukaryotes. In addition, euryarchaea-specific expansions of several protein and domain families were detected. In terms of their apparent phylogenetic affinities, the archaeal protein families split into bacterial and eukaryotic families. The majority of the proteins that have only eukaryotic orthologs or show the greatest similarity to their eukaryotic counterparts belong to the core set. The families of euryarchaeal genes that are conserved in only two or three species constitute a relatively mobile component of the genomes whose evolution should have involved multiple events of lineage-specific gene loss and horizontal gene transfer. Frequently these proteins have detectable orthologs only in bacteria or show the greatest similarity to the bacterial homologs, which might suggest a significant role of horizontal gene transfer from bacteria in the evolution of the euryarchaeota.

Journal ArticleDOI
TL;DR: A pairwise similarity measure between two p-dimensional data points, x and y, is introduced that is superior to commonly used metric distances, for example, Euclidean distance and a modified version of mutual information is introduced as a novel method for validating clustering results when the true clustering is known.
Abstract: Clustering is one of the main mathematical challenges in large-scale gene expression analysis. We describe a clustering procedure based on a sequential k-means algorithm with additional refinements that is able to handle high-throughput data in the order of hundreds of thousands of data items measured on hundreds of variables. The practical motivation for our algorithm is oligonucleotide fingerprinting—a method for simultaneous determination of expression level for every active gene of a specific tissue—although the algorithm can be applied as well to other large-scale projects like EST clustering and qualitative clustering of DNA-chip data. As a pairwise similarity measure between two p-dimensional data points, x and y, we introduce mutual information that can be interpreted as the amount of information about x in y, and vice versa. We show that for our purposes this measure is superior to commonly used metric distances, for example, Euclidean distance. We also introduce a modified version of mutual information as a novel method for validating clustering results when the true clustering is known. The performance of our algorithm with respect to experimental noise is shown by extensive simulation studies. The algorithm is tested on a subset of 2029 cDNA clones coming from 15 different genes from a cDNA library derived from human dendritic cells. Furthermore, the clustering of these 2029 cDNA clones is demonstrated when the entire set of 76,032 cDNA clones is processed.

Journal ArticleDOI
TL;DR: The genomic trees presented here place the Archaea in the proximity of the Bacteria when the whole gene content of each organism is considered, and when ancestral gene duplications are eliminated.
Abstract: The availability of a number of complete cellular genome sequences allows the development of organisms’ classification, taking into account their genome content, the loss or acquisition of genes, and overall gene similarities as signatures of common ancestry. On the basis of correspondence analysis and hierarchical classification methods, a methodological framework is introduced here for the classification of the available 20 completely sequenced genomes and partial information for Schizosaccharomyces pombe, Homo sapiens, and Mus musculus. The outcome of such an analysis leads to a classification of genomes that we call a genomic tree. Although these trees are phenograms, they carry with them strong phylogenetic signatures and are remarkably similar to 16S-like rRNA-based phylogenies. Our results suggest that duplication and deletion events that took place through evolutionary time were globally similar in related organisms. The genomic trees presented here place the Archaea in the proximity of the Bacteria when the whole gene content of each organism is considered, and when ancestral gene duplications are eliminated. Genomic trees represent an additional approach for the understanding of evolution at the genomic level and may contribute to the proper assessment of the evolutionary relationships between extant species.

Journal ArticleDOI
TL;DR: A prototype assay for genotyping a panel of 35 biallelic sites that represent variation within 15 genes from biochemical pathways implicated in the development and progression of cardiovascular disease, which provides a research tool for studies of multilocus genetic risk factors in large cardiovascular disease cohorts, and for the subsequent development of diagnostic tests.
Abstract: A number of chronic diseases, including cardiovascular disease, appear to have a multifactorial genetic risk component. Consequently, techniques are needed to facilitate evaluation of complex genetic risk factors in large cohorts. We have designed a prototype assay for genotyping a panel of 35 biallelic sites that represent variation within 15 genes from biochemical pathways implicated in the development and progression of cardiovascular disease. Each DNA sample is amplified using two multiplex polymerase chain reactions, and the alleles are genotyped simultaneously using an array of immobilized, sequence-specific oligonucleotide probes. This multilocus assay was applied to two types of cohorts. Population frequencies for the markers were estimated using 496 unrelated individuals from a family-based cohort, and the observed values were consistent with previous reports. Linkage disequilibrium between consecutive pairs of markers within the apoCIII, LPL, and ELAM genes was also estimated. A preliminary analysis of single and pairwise locus associations with severity of atherosclerosis was performed using a composite cohort of 142 individuals for whom quantitative angiography data were available; evaluation of the potentially interesting associations observed will require analysis of an independent and larger cohort. This assay format provides a research tool for studies of multilocus genetic risk factors in large cardiovascular disease cohorts, and for the subsequent development of diagnostic tests.

Journal ArticleDOI
TL;DR: The development of a self-contained (homogeneous), single-tube assay for the genotyping of single-nucleotide polymorphisms (SNPs), which does not rely on fluorescent oligonucleotide probes, and is recommended both as a simple and inexpensive diagnostic tool for genotyped medically relevant SNPs and as a high-throughput SNP genotypesing method for gene mapping.
Abstract: We report the development of a self-contained (homogeneous), single-tube assay for the genotyping of single-nucleotide polymorphisms (SNPs), which does not rely on fluorescent oligonucleotide probes. The method, which we call Tm-shift genotyping, combines allele-specific PCR with the discrimination between amplification products by their melting temperatures (Tm). Two distinct forward primers, each of which contains a 3′-terminal base that corresponds to one of the two SNP allelic variants, are combined with a common reverse primer in a single-tube reaction. A GC-tail is attached to one of the forward allele-specific primers to increase the Tm of the amplification product from the corresponding allele. PCR amplification, Tm analysis, and allele determination of genomic template DNA are carried out on a fluorescence-detecting thermocycler with a dye that fluoresces when bound to dsDNA. We demonstrate the accuracy and reliability of Tm-shift genotyping on 100 samples typed for two SNPs, and recommend it both as a simple and inexpensive diagnostic tool for genotyping medically relevant SNPs and as a high-throughput SNP genotyping method for gene mapping.

Journal ArticleDOI
TL;DR: There is increasing evidence for the primacy of selection in molding genome sizes via impacts on cell size and division rates and processes inducing quantum or doubling series variation in gametic or somatic genome sizes are common.
Abstract: The forces responsible for modulating the large-scale features of the genome remain one of the most difficult issues confronting evolutionary biology. Although diversity in chromosomal architecture, nucleotide composition, and genome size has been well documented, there is little understanding of either the evolutionary origins or impact of much of this variation. The 80,000-fold divergence in genome sizes among eukaryotes represents perhaps the greatest challenge for genomic holists. Although some researchers continue to characterize much variation in genome size as a mere by-product of an intragenomic selfish DNA "free-for-all" there is increasing evidence for the primacy of selection in molding genome sizes via impacts on cell size and division rates. Moreover, processes inducing quantum or doubling series variation in gametic or somatic genome sizes are common. These abrupt shifts have broad effects on phenotypic attributes at both cellular and organismal levels and may play an important role in explaining episodes of rapid-or even saltational-character state evolution.

Journal ArticleDOI
TL;DR: A data set of 77 genomic mouse/human gene pairs has been compiled from the EMBL nucleotide database, and their corresponding features determined, and a new alignment algorithm was developed to cope with the fact that large parts of noncoding sequences are not alignable in a meaningful way because of genetic drift.
Abstract: A data set of 77 genomic mouse/human gene pairs has been compiled from the EMBL nucleotide database, and their corresponding features determined. This set was used to analyze the degree of conservation of noncoding sequences between mouse and human. A new alignment algorithm was developed to cope with the fact that large parts of noncoding sequences are not alignable in a meaningful way because of genetic drift. This new algorithm, DNA Block Aligner (DBA), finds colinear-conserved blocks that are flanked by nonconserved sequences of varying lengths. The noncoding regions of the data set were aligned with DBA. The proportion of the noncoding regions covered by blocks >60% identical was 36% for upstream regions, 50% for 5' UTRs, 23% for introns, and 56% for 3' UTRs. These blocks of high identity were more or less evenly distributed across the length of the features, except for upstream regions in which the first 100 bp upstream of the transcription start site was covered in up to 70% of the gene pairs. This data set complements earlier sets on the basis of cDNA sequences and will be useful for further comparative studies. [This paper contains supplementary data that can be found at http://www.genome.org [corrected]].

Journal ArticleDOI
TL;DR: Using both env and long terminal repeat (LTR) sequences, with maximal representation of genetic diversity within primate strains, this work revise and expand the unique evolutionary history of human and simian T-cell leukemia/lymphotropic viruses (HTLV/STLV).
Abstract: Using both env and long terminal repeat (LTR) sequences, with maximal representation of genetic diversity within primate strains, we revise and expand the unique evolutionary history of human and simian T-cell leukemia/lymphotropic viruses (HTLV/STLV). Based on the robust application of three different phylogenetic algorithms of minimum evolution-neighbor joining, maximum parsimony, and maximum likelihood, we address overall levels of genetic diversity, specific rates of mutation within and between different regions of the viral genome, relatedness among viral strains from geographically diverse regions, and estimation of the pattern of divergence of the virus into extant lineages. Despite broad genomic similarities, type I and type II viruses do not share concordant evolutionary histories. HTLV-I/STLV-I are united through distinct phylogeographic patterns, infection of 20 primate species, multiple episodes of interspecies transmission, and exhibition of a range in levels of genetic divergence. In contrast, type II viruses are isolated from only two species (Homo sapiens and Pan paniscus) and are paradoxically endemic to both Amerindian tribes of the New World and human Pygmy villagers in Africa. Furthermore, HTLV-II is spreading rapidly through new host populations of intravenous drug users. Despite such clearly disparate host populations, the resultant HTLV-II/STLV-II phylogeny exhibits little phylogeographic concordance and indicates low levels of transcontinental genetic differentiation. Together, these patterns generate a model of HTLV/STLV emergence marked by an ancient ancestry, differential rates of divergence, and continued global expansion.

Journal ArticleDOI
TL;DR: This comparative analysis identified pairs of zebrafish genes that appear to be orthologous to single mammalian genes, suggesting that these genes arose in a genome duplication that occurred in the teleost lineage after the divergence of fish and mammal ancestors.
Abstract: Genetic screens in zebrafish (Danio rerio) have isolated mutations in hundreds of genes with essential functions. To facilitate the identification of candidate genes for these mutations, we have genetically mapped 104 genes and expressed sequence tags by scoring single-strand conformational polymorphisms in a panel of haploid siblings. To integrate this map with existing genetic maps, we also scored 275 previously mapped genes, microsatellites, and sequence-tagged sites in the same haploid panel. Systematic phylogenetic analysis defined likely mammalian orthologs of mapped zebrafish genes, and comparison of map positions in zebrafish and mammals identified significant conservation of synteny. This comparative analysis also identified pairs of zebrafish genes that appear to be orthologous to single mammalian genes, suggesting that these genes arose in a genome duplication that occurred in the teleost lineage after the divergence of fish and mammal ancestors. This comparative map analysis will be useful in predicting the locations of zebrafish genes from mammalian gene maps and in understanding the evolution of the vertebrate genome.

Journal ArticleDOI
TL;DR: d2_cluster is described, an agglomerative algorithm for rapidly and accurately partitioning transcript databases into index classes by clustering sequences according to minimal linkage or "transitive closure" rules and the relative efficiency of d2_Cluster with respect to other clustering tools is evaluated.
Abstract: Several efforts are under way to condense single-read expressed sequence tags (ESTs) and full-length transcript data on a large scale by means of clustering or assembly. One goal of these projects is the construction of gene indices where transcripts are partitioned into index classes (or clusters) such that they are put into the same index class if and only if they represent the same gene. Accurate gene indexing facilitates gene expression studies and inexpensive and early partial gene sequence discovery through the assembly of ESTs that are derived from genes that have yet to be positionally cloned or obtained directly through genomic sequencing. We describe d2_cluster, an agglomerative algorithm for rapidly and accurately partitioning transcript databases into index classes by clustering sequences according to minimal linkage or “transitive closure” rules. We then evaluate the relative efficiency of d2_cluster with respect to other clustering tools. UniGene is chosen for comparison because of its high quality and wide acceptance. It is shown that although d2_cluster and UniGene produce results that are between 83% and 90% identical, the joining rate of d2_cluster is between 8% and 20% greater than UniGene. Finally, we present the first published rigorous evaluation of under and over clustering (in other words, of type I and type II errors) of a sequence clustering algorithm, although the existence of highly identical gene paralogs means that care must be taken in the interpretation of the type II error. Upper bounds for these d2_cluster error rates are estimated at 0.4% and 0.8%, respectively. In other words, the sensitivity and selectivity of d2_cluster are estimated to be >99.6% and 99.2%. [Supplementary material to this paper may be found online at www.genome.org and at www.pangeasystems.com.]

Journal ArticleDOI
TL;DR: Fidelity comparison with genome reference sequence AC004106 demonstrates consensus expression clusters that reflect significantly lower spurious repeat sequence content and capture alternate splicing within a whole body index cluster and three STACK v.2.3 tissue-level clusters.
Abstract: The expressed human genome is being sequenced and analyzed by disparate groups producing disparate data The majority of the identified coding portion is in the form of expressed sequence tags (ESTs) The need to discover exonic representation and expression forms of full-length cDNAs for each human gene is frustrated by the partial and variable quality nature of this data delivery A highly redundant human EST data set has been processed into integrated and unified expressed transcript indices that consist of hierarchically organized human transcript consensi reflecting gene expression forms and genetic polymorphism within an index class The expression index and its intermediate outputs include cleaned transcript sequence, expression, and alignment information and a higher fidelity subset, SANIGENE The STACK_PACK clustering system has been applied to dbEST release 121598 (GenBank version 110) Sixty-four percent of 1,313, 103 Homo sapiens ESTs are condensed into 143,885 tissue level multiple sequence clusters; linking through clone-ID annotations produces 68,701 total assemblies, such that 81% of the original input set is captured in a STACK multiple sequence or linked cluster Indexing of alignments by substituent EST accession allows browsing of the data structure and its cross-links to UniGene STACK metaclusters consolidate a greater number of ESTs by a factor of 1 86 with respect to the corresponding UniGene build Fidelity comparison with genome reference sequence AC004106 demonstrates consensus expression clusters that reflect significantly lower spurious repeat sequence content and capture alternate splicing within a whole body index cluster and three STACK v23 tissue-level clusters Statistics of a staggered release whole body index build of STACK v20 are presented

Journal ArticleDOI
TL;DR: The upstream region of this gene in a New World monkey, the marmoset, is sequenced to demonstrate the presence of an LCR in an equivalent position to that in Old World primates, and extensive homology from the coding region to the LCR with the upstream sequence of the human LW gene is shown.
Abstract: Trichromacy in all Old World primates is dependent on separate X-linked MW and LW opsin genes that are organized into a head-to-tail tandem array flanked on the upstream side by a locus control region (LCR). The 5' regions of these two genes show homology for only the first 236 bp, although within this region, the differences are conserved in humans, chimpanzees, and two species of cercopithecoid monkeys. In contrast, most New World primates have only a single polymorphic X-linked opsin gene; all males are dichromats and trichromacy is achieved only in those females that possess a different form of this gene on each X chromosome. By sequencing the upstream region of this gene in a New World monkey, the marmoset, we have been able to demonstrate the presence of an LCR in an equivalent position to that in Old World primates. Moreover, the marmoset sequence shows extensive homology from the coding region to the LCR with the upstream sequence of the human LW gene, a distance of >3 kb, whereas homology with the human MW gene is again limited to the first 236 bp, indicating that the divergent MW sequence identifies the site of insertion of the duplicated gene. This is further supported by the presence of an incomplete Alu element on the upstream side of this insertion point in the MW gene of both humans and a cercopithecoid monkey, with additional Alu elements present further upstream. Therefore, these Alu elements may have been involved in the initial gene duplication and may also be responsible for the high frequency of gene loss and gene duplication within the opsin gene array. Full trichromacy is present in one species of New World monkey, the howler monkey, in which separate MW and LW genes are again present. In contrast to the separate genes in humans, however, the upstream sequences of the two howler genes show homology with the marmoset for at least 600 bp, which is well beyond the point of divergence of the human MW and LW genes, and each sequence is associated with a different LCR, indicating that the duplication in the howler monkey involved the entire upstream region. [The sequence data described in this paper have been submitted to GenBank under accession nos. AF155218, AF156715, and AF156716.]

Journal ArticleDOI
TL;DR: The nematode Caenorhabditis elegans is the first animal whose genome is completely sequenced, providing a rich source of gene information relevant to metazoan biology and human disease, which permits a broad-based gene inactivation approach in C. elegans to be scaled up.
Abstract: The rapid progress in genome sequencing projects has propelled a shift in genome analysis from structural genomics to functional genomics, that is, the genome-driven systematic study of gene function (Hieter and Boguski 1997). In this postgenomic era, model organisms in which large-scale functional analyses and rapid genetic experiments are possible—chiefly the yeast Saccharomyces cerevesiae, the fruit fly Drosophila melanogaster, and the nematode Caenorhabditis elegans—are increasingly useful in understanding human disease pathways (Miklos and Rubin 1996; Oliver 1996; Ahringer 1997). With the recent completion of the C. elegans genomic sequence (The C. elegans Sequencing Consortium 1998), the first complete animal genome is now available. Computational analysis of partial C. elegans genome data has revealed that many positionally cloned human disease genes have C. elegans orthologs (genes encoding proteins with similar multidomain architecture and predicted function) (Mushegian et al. 1997, 1998). The C. elegans genome also contains a significant proportion of apparently nematode-specific protein families that may be relevant for nematode biology and parasitism (Sonnhammer and Durbin 1997; Blaxter 1998). A rapid method to ascertain gene function by targeted gene inactivation in C. elegans would be highly desirable, but homologous recombination as in the mouse has not yet proven feasible (Plasterk 1995). Microinjection of target-specific RNA, for reasons not completely understood, is a remarkably effective means of transcriptional interruption in this small organism (Fire et al. 1998). However, the extent of RNA-mediated inactivation is difficult to assess in many cases, and such inactivation obviously does not produce a germ-line lesion necessary for genetic crosses, suppressor screens, and other longer-term genetic manipulations. One large-scale approach to germ-line inactivation is to induce random mutations in the animal population, followed by screening for mutations in a target gene of known sequence. The most well-developed method of this so-called target-selected gene inactivation in C. elegans has used random transposon Tc1 insertions to generate a collection of mutants, followed by PCR screening for the presence of Tc1 in a gene of choice (Zwaal et al. 1993). However, because Tc1 insertion alone does not usually result in gene inactivation, it is necessary to subsequently screen individual Tc1 alleles for animals in which the transposon and flanking DNA have been deleted through transposon excision, a relatively infrequent event (Plasterk 1995). An alternative approach is to use chemical mutagens to directly induce deletions in a population, and then to screen by PCR for deleted segments within a selected target region (Yandell et al. 1994). Jansen et al. (1997, 1999) have established the broader feasibility of this approach by isolating mutants of the C. elegans heterotrimeric G protein gene family. In this study we describe our results using random chemical mutagenesis and PCR screening to rapidly isolate deletion mutations in a large number of genes encoding proteins with a broad range of functions. C. elegans is unique among model animal species in that it can be grown in liquid cultures and also can be stored frozen but viable at −80°C. We have taken advantage of these properties to devise a rapid and scalable procedure for gene disruption almost entirely on the basis of microtiter plate arrays of whole animals and genomic DNA. We used four different chemical agents to create mutagenized libraries, and found that all four mutagens induce detectable deletions. Almost all of the deletions were significant enough to result in loss of exons, frame shifts, and other molecular lesions likely to cause loss of gene function. We discuss the sensitivity, specificity, limitations, and broader utility of this approach to systematic gene inactivation in C. elegans.

Journal ArticleDOI
TL;DR: The construction and integration of two genomic maps are reported: a dense genetic linkage map of the rat and the first radiation hybrid (RH) map ofThe rat, which provide the basic tools for rat genomics.
Abstract: The laboratory rat (Rattus norvegicus) is a key animal model for biomedical research. However, the genetic infrastructure required for connecting phenotype and genotype in the rat is currently incomplete. Here, we report the construction and integration of two genomic maps: a dense genetic linkage map of the rat and the first radiation hybrid (RH) map of the rat. The genetic map was constructed in two F2 intercrosses (SHRSP x BN and FHH x ACI), containing a total of 4736 simple sequence length polymorphism (SSLP) markers. Allele sizes for 4328 of the genetic markers were characterized in 48 of the most commonly used inbred strains. The RH map is a lod >/= 3 framework map, including 983 SSLPs, thereby allowing integration with markers on various genetic maps and with markers mapped on the RH panel. Together, the maps provide an integrated reference to >3000 genes and ESTs and >8500 genetic markers (5211 of our SSLPs and >3500 SSLPs developed by other groups). [Bihoreau et al. (1997); James and Tanigami, RHdb (http:www.ebi.ac.uk/RHdb/index.html); Wilder (http://www.nih.gov/niams/scientific/ratgbase); Serikawa et al. (1992); RATMAP server (http://ratmap.gen.gu.se)] RH maps (v. 2.0) have been posted on our web sites at http://goliath.ifrc.mcw.edu/LGR/index.html or http://curatools.curagen.com/ratmap. Both web sites provide an RH mapping server where investigators can localize their own RH vectors relative to this map. The raw data have been deposited in the RHdb database. Taken together, these maps provide the basic tools for rat genomics. The RH map provides the means to rapidly localize genetic markers, genes, and ESTs within the rat genome. These maps provide the basic tools for rat genomics. They will facilitate studies of multifactorial disease and functional genomics, allow construction of physical maps, and provide a scaffold for both directed and large-scale sequencing efforts and comparative genomics in this important experimental organism.

Journal ArticleDOI
TL;DR: The role the rat will play in annotating the genome in the functional genomics era is focused on, as the most widely studied experimental animal model for biomedical research.
Abstract: The 20th Century has seen a remarkable number of inventions and technological advances in virtually all aspects of human life and health care. Many areas of biomedical research have made great strides in unraveling the cause of human disease and in developing new therapies to counter, or at least improve, outcome from disease. However, the cause of the vast majority of common disease remains poorly defined. In the final year of the millennium, the release of the draft sequence of the human genome promises to bring in a new era for basic science research and, hopefully, unprecedented growth in our understanding of human disease. For this to occur there is a critical need to annotate the genomic sequence with gene function and basic biology. Typically, the view from the geneticist immediately turns to mouse, as the mammalian contributor. Yet, not all biologists are willing to convert to the mouse as their system of choice, in many cases because of the existence of better models. Although the mouse is undoubtedly going to play a major role in contributing to the annotation of gene function, other mammalian species will also make significant contributions. This Insight/Outlook piece focuses on the role the rat will play in annotating the genome in the functional genomics era. The laboratory rat, Rattus norvegicus, was the first mammalian species domesticated for scientific research, with work dating back to before 1850 (Lindsey 1979). From this auspicious beginning, the rat has become the most widely studied experimental animal model for biomedical research. Since 1966 (the earliest year covered by the Medline database), nearly 500,000 research articles reporting the use of rats have been published, most focused on evaluating the biology and/or the pathobiology of the rat. In contrast to its central role in the study of behavior, biochemistry, neurobiology, physiology, and pharmacology, the rat has lagged far behind the mouse as a genetic “model” organism, until recently. Historically, rat genetics had a surprisingly early start. The first genetic studies were carried out by Crampe from 1877 to 1885 and focused on the inheritance of coat color (Lindsey 1979). Hugo De Vries, Karl Correns, and Erich Tschermak rediscovered Mendel’s laws at the turn of the century, and Bateson used these concepts in 1903 to demonstrate that rat coat color is a Mendelian trait (Lindsey 1979). The first rat inbred strain, PA, was established by King in 1909—the same year that inbreeding began for the first inbred strain of mouse, DBA (Lindsey 1979). Despite this parallel start, the mouse soon became the model of choice for mammalian geneticists, whereas the rat became the model of choice for physiologists, nutritionists, and other biomedical researchers. Geneticists preferred the mouse because of its smaller size, which simplified housing requirements, and the availability of many coat color and other mutants exhibiting Mendelian patterns of inheritance, which had been collected by mouse fanciers (Nishioka 1995). Physiologists and other biomedical researchers favored the rat because its larger size facilitated experimental interventions. Over time a large number of rat strains were used to develop disease models by selective breeding, which “fixes” natural disease alleles in particular strains or colonies (Greenhouse et al. 1990). For example, there are inbred strains of rats used for research in the following areas: addiction, aging, anatomy, autoimmune diseases, behavior, blood diseases, breast cancer, cardiovascular diseases, cancer, comparative genomics, dental diseases, diseases of the skin and hair, endocrinology, eye disorders, growth and reproduction, hematologic disorders, histology, kidney diseases, metabolic disorders, neurological and neuromuscular diseases, nutrition, pathophysiology, pharmacology, pulmonary diseases, physiology, reproductive disorders, skeletal disorders, sleep apnea, transplantation and immunogenetics, toxicology, and urological disorders (Gill et al. 1989; Greenhouse et al. 1990; James and Lindpaintner 1997).

Journal ArticleDOI
TL;DR: The nuclear receptor (NR) superfamily is the most abundant class of transcriptional regulators encoded in the Caenorhabditis elegans genome, with >200 predicted genes revealed by the screens and analysis of genomic sequence reported here.
Abstract: The nuclear receptor (NR) superfamily is the most abundant class of transcriptional regulators encoded in the Caenorhabditis elegans genome, with >200 predicted genes revealed by the screens and analysis of genomic sequence reported here. This is the largest number of NR genes yet described from a single species, although our analysis of available genomic sequence from the related nematode Caenorhabditis briggsae indicates that it also has a large number. Existing data demonstrate expression for 25% of the C. elegans NR sequences. Sequence conservation and statistical arguments suggest that the majority represent functional genes. An analysis of these genes based on the DNA-binding domain motif revealed that several NR classes conserved in both vertebrates and insects are also represented among the nematode genes, consistent with the existence of ancient NR classes shared among most, and perhaps all, metazoans. Most of the nematode NR sequences, however, are distinct from those currently known in other phyla, and reveal a previously unobserved diversity within the NR superfamily. In C. elegans, extensive proliferation and diversification of NR sequences have occurred on chromosome V, accounting for > 50% of the predicted NR genes.

Journal ArticleDOI
TL;DR: Of the 40,000 most-abundant human genes, these 8 are the most closely linked to the known diagnostic genes, and thus are prime targets for pharmaceutical research.
Abstract: We wish to identify genes associated with disease. To do so, we look for novel genes whose expression patterns mimic those of known disease-associated genes, using a method we call Guilt-by-Association (GBA), on the basis of a combinatoric measure of association. Using GBA, we have examined the expression of 40,000 human genes in 522 cDNA libraries, and have discovered several hundred previously unidentified genes associated with cancer, inflammation, steroid-synthesis, insulin-synthesis, neurotransmitter processing, matrix remodeling, and other disease processes. The majority of the genes thus discovered show no sequence similarity to known genes, and thus could not have been identified by homology searches. We present here an example of the discovery of eight genes associated with prostate cancer. Of the 40,000 most-abundant human genes, these 8 are the most closely linked to the known diagnostic genes, and thus are prime targets for pharmaceutical research. [The sequence data described in this paper have been submitted to the GenBank data library under accession nos. {"type":"entrez-nucleotide","attrs":{"text":"AF109298","term_id":"6782691","term_text":"AF109298"}}AF109298–{"type":"entrez-nucleotide","attrs":{"text":"AF109303","term_id":"6782698","term_text":"AF109303"}}AF109303.]

Journal ArticleDOI
TL;DR: A sensitive protein-fold recognition procedure was developed on the basis of iterative database search using the PSI-BLAST program and a collection of 1193 position-dependent weight matrices that can be used as fold identifiers was produced.
Abstract: A sensitive protein-fold recognition procedure was developed on the basis of iterative database search using the PSI-BLAST program. A collection of 1193 position-dependent weight matrices that can be used as fold identifiers was produced. In the completely sequenced genomes, folds could be automatically identified for 20%-30% of the proteins, with 3%-6% more detectable by additional analysis of conserved motifs. The distribution of the most common folds is very similar in bacteria and archaea but distinct in eukaryotes. Within the bacteria, this distribution differs between parasitic and free-living species. In all analyzed genomes, the P-loop NTPases are the most abundant fold. In bacteria and archaea, the next most common folds are ferredoxin-like domains, TIM-barrels, and methyltransferases, whereas in eukaryotes, the second to fourth places belong to protein kinases, beta-propellers and TIM-barrels. The observed diversity of protein folds in different proteomes is approximately twice as high as it would be expected from a simple stochastic model describing a proteome as a finite sample from an infinite pool of proteins with an exponential distribution of the fold fractions. Distribution of the number of domains with different folds in one protein fits the geometric model, which is compatible with the evolution of multidomain proteins by random combination of domains. [Fold predictions for proteins from 14 proteomes are available on the World Wide Web at. The FIDs are available by anonymous ftp at the same location.]

Journal ArticleDOI
TL;DR: In this survey, three recent experiments related to transcriptional regulation are reviewed and the great challenge for computational biologists trying to extract functional information from large-scale gene expression data is discussed.
Abstract: The use of high-density DNA arrays to monitor gene expression at a genome-wide scale constitutes a fundamental advance in biology. In particular, the expression pattern of all genes in Saccharomyces cerevisiae can be interrogated using microarray analysis where cDNAs are hybridized to an array of each of the approximately 6000 genes in the yeast genome. In this survey I review three recent experiments related to transcriptional regulation and discuss the great challenge for computational biologists trying to extract functional information from such large-scale gene expression data.

Journal ArticleDOI
TL;DR: A flurry of candidate gene-cloning experiments revealed that the organism is an attractive one for developmental biology and clearly demonstrates the use of zebrafish for establishing embryonic axis and early neurogenesis and further establishes the case that some of the zebra fish mutants will represent human diseases.
Abstract: Picture this—you have just mapped a human disease locus to a particular region of a chromosome. With a click of a computer button, the region of chromosomal synteny in the zebrafish (Danio rerio) genome is revealed. Behold, there are several mutant zebrafish loci mapped in this general region of synteny. Another click and you find a fish mutant resembling your human disease. Further clicking reveals several independent alleles with varying phenotypes establishing the pathophysiology of the human disease. Does this sound farfetched? Well, recently several zebrafish mutants with “human” diseases have been found. With more infrastructure for the zebrafish system, the above scenario could become commonplace. The zebrafish is an excellent system for developmental biologists and geneticists (Westerfield 1989; Detrich et al. 1999). The externally developing embryos are clear, allowing visualization of organ systems. The 1-inch size of the zebrafish allows large numbers of these vertebrates to be maintained in a relatively small space. In addition, each female lays >200 eggs per week. This enables the study of large numbers of meioses for positional cloning purposes. The genetic map has been continually improving over the past 2 years, and currently >2000 microsatellite markers and up to 400 genes have been defined (Knapik et al. 1998; Postlethwait et al. 1998) for the 1.7 2 10-bp genome (M. Fishman and J. Postlethwait, unpubl). The zebrafish system was originally envisioned to provide important clues to normal embryogenesis and organ development. Because it is a vertebrate, the organism would bridge the gap between Drosophila/Caenhorhabditis elegans and mouse/human genetics. A flurry of candidate gene-cloning experiments revealed that the organism is an attractive one for developmental biology and clearly demonstrates the use of zebrafish for establishing embryonic axis and early neurogenesis (Solnica-Krezel 1999). Another hope for the system was that the vertebrate zebrafish would relate to the human, and mutants could define disease loci. Positional cloning approaches in the zebrafish have been made possible by the development of key reagents such as YAC, PAC, and BAC libraries (Amemiya et al. 1999), as well as radiation hybrid panels (Kwok et al. 1998; M. Ekker, unpubl.). The first positional cloning project involved the isolation of the one-eyed-pinhead gene (Zhang et al. 1998), a novel cell surface molecule with EGF repeats. The second positional cloning project involved the isolation of the gene sauternes (sau) (Brownlie et al. 1998). Sau mutants have a normal number of blood cells circulating on day 2, but these blood cells fail to make hemoglobin. This mutant phenotype proved to be due to a defect in the erythroid synthase d-aminolevulinate synthase (ALAS-2) gene, which regulates the first step in heme biosynthesis in embryonic red cells. Human patients with ALAS-2 mutations have a disease very similar to the fish called congenital sideroblastic anemia, establishing this zebrafish mutant as the first animal model of this human disease (see Fig. 1). Additionally, Shuo Lin and coworkers have provided evidence that the yquem (yqe) mutant is due to uroporphyrinogen decarboxylase (UROD) deficiency (Wang et al. 1998). This fish has the equivalent of human porphyria and further establishes the case that some of the zebrafish mutants will represent human diseases. Are the blood mutants unique among the zebrafish as to their relevance to human disease? Clearly, there are other phenotypes among all the zebrafish mutants that resemble human disorders (Driever and Fishman 1996). For instance, the zebrafish gridlock mutant has a defect similar to coarctation of the aorta in humans (Weinstein et al. 1995). In addition, there are zebrafish mutants with cystic kidneys that may represent polycystic kidney disease of humans (Drummond et al. 1998). It remains for clinicians to examine the zebrafish issue of Development (1996) to see whether other phenotypes resemble interesting diseases. It was known previously that the mouse and human genomes share large blocks of chromosomal synteny, but no one believed that the fish chromosomal structure would resemble that of the human. For many chromosomal loci, the synteny is obvious between the fish and the human (Postlethwait et al. 1998). This facilitates positional cloning of the zebrafish genes, which can utilize information from the Human Genome Project. A zebrafish researcher can scour the human databases and look for candidate genes in the region near a zebrafish mutation. In the future, it should be possible for investigators studying human genetics to be able to interface directly to a zebrafish Web site (The Zebrafish Server, The Fish Net, ZFIN, http://zfish.uoregon.edu/) and evaluate mutants in a region of interest to the investigator. This process of “genome ping-ponging” based on these syntenic relationships will further establish the usefulness of the zebrafish for understanding human disease. The article by Davidson et al. in this issue demonstrates the power of zebrafish to examine conserved genes and genome structure among the vertebrates. The GDF genes encode critical E-MAIL zon@rascal.med.harvard.edu; FAX (617) 355-7262. Insight/Outlook