scispace - formally typeset
Search or ask a question

Showing papers in "Genome Research in 1998"


Journal ArticleDOI
TL;DR: In this article, a base-calling program for automated sequencer traces, phred, with improved accuracy was proposed. But it was not shown to achieve a lower error rate than the ABI software, averaging 40%-50% fewer errors in the data sets examined independent of position in read, machine running conditions, or sequencing chemistry.
Abstract: The availability of massive amounts of DNA sequence information has begun to revolutionize the practice of biology. As a result, current large-scale sequencing output, while impressive, is not adequate to keep pace with growing demand and, in particular, is far short of what will be required to obtain the 3-billion-base human genome sequence by the target date of 2005. To reach this goal, improved automation will be essential, and it is particularly important that human involvement in sequence data processing be significantly reduced or eliminated. Progress in this respect will require both improved accuracy of the data processing software and reliable accuracy measures to reduce the need for human involvement in error correction and make human review more efficient. Here, we describe one step toward that goal: a base-calling program for automated sequencer traces, phred, with improved accuracy. phred appears to be the first base-calling program to achieve a lower error rate than the ABI software, averaging 40%-50% fewer errors in the data sets examined independent of position in read, machine running conditions, or sequencing chemistry.

7,627 citations


Journal ArticleDOI
TL;DR: The ability to estimate a probability of error for each base-call, as a function of certain parameters computed from the trace data, is developed and implemented in the base-calling program.
Abstract: Elimination of the data processing bottleneck in high-throughput sequencing will require both improved accuracy of data processing software and reliable measures of that accuracy. We have developed and implemented in our base-calling program phred the ability to estimate a probability of error for each base-call, as a function of certain parameters computed from the trace data. These error probabilities are shown here to be valid (correspond to actual error rates) and to have high power to discriminate correct base-calls from incorrect ones, for read data collected under several different chemistries and electrophoretic conditions. They play a critical role in our assembly program phrap and our finishing program consed.

5,334 citations


Journal ArticleDOI
TL;DR: A finishing tool, consed, which attempts to implement principles of shotgun sequencing by using error probabilities from phred and phrap as an objective criterion to guide the entire finishing process.
Abstract: Sequencing of large clones or small genomes is generally done by the shotgun approach (Anderson et al. 1982). This has two phases: (1) a shotgun phase in which a number of reads are generated from random subclones and assembled into contigs, followed by (2) a directed, or finishing phase in which the assembly is inspected for correctness and for various kinds of data anomalies (such as contaminant reads, unremoved vector sequence, and chimeric or deleted reads), additional data are collected to close gaps and resolve low quality regions, and editing is performed to correct assembly or base-calling errors. Finishing is currently a bottleneck in large-scale sequencing efforts, and throughput gains will depend both on reducing the need for human intervention and making it as efficient as possible. We have developed a finishing tool, consed, which attempts to implement these principles. A distinguishing feature relative to other programs is the use of error probabilities from our programs phred and phrap as an objective criterion to guide the entire finishing process. More information is available at http:// www.genome.washington.edu/consed/consed. html.

3,486 citations


Journal ArticleDOI
TL;DR: A new model adapted and expanded from one proposed for the evolution of vertebrate major histocompatibility complex and immunoglobulin gene families is proposed resulting in evolution of individual R genes within a haplotype that emphasizes divergent selection acting on arrays of solvent-exposed residues in the LRR.
Abstract: Classical genetic and molecular data show that genes determining disease resistance in plants are frequently clustered in the genome. Genes for resistance (R genes) to diverse pathogens cloned from several species encode proteins that have motifs in common. These motifs indicate that R genes are part of signal-transduction systems. Most of these R genes encode a leucine-rich repeat (LRR) region. Sequences encoding putative solvent-exposed residues in this region are hypervariable and have elevated ratios of nonsynonymous to synonymous substitutions; this suggests that they have evolved to detect variation in pathogen-derived ligands. Generation of new resistance specificities previously had been thought to involve frequent unequal crossing-over and gene conversions. However, comparisons between resistance haplotypes reveal that orthologs are more similar than paralogs implying a low rate of sequence homogenization from unequal crossing-over and gene conversion. We propose a new model adapted and expanded from one proposed for the evolution of vertebrate major histocompatibility complex and immunoglobulin gene families. Our model emphasizes divergent selection acting on arrays of solvent-exposed residues in the LRR resulting in evolution of individual R genes within a haplotype. Intergenic unequal crossing-over and gene conversions are important but are not the primary mechanisms generating variation.

1,022 citations


Journal ArticleDOI
TL;DR: A large number of mapped SNPs will be valuable as markers throughout the genome for finding SNPs that do affect gene function, as linkage disequilibrium over tens to hundreds of kilobases is expected to be found in many regions of the human genome.
Abstract: perform association analysis on many affected and unaffected individuals, which would require hundreds of thousands of variants spread over the entire genome (Risch and Merikangas 1996). Such a large number of variants is currently not available. The DNA Polymorphism Discovery Resource is designed to promote their discovery. About 90% of sequence variants in humans are differences in single bases of DNA, called single nucleotide polymorphisms (SNPs). SNPs in the coding regions of genes (cSNPs) or in regulatory regions are more likely to cause functional differences than SNPs elsewhere. Although most SNPs do not affect gene function, a large number of mapped SNPs will be valuable as markers throughout the genome for finding SNPs that do affect gene function, as linkage disequilibrium over tens to hundreds of kilobases is expected to be found in many regions of the human genome. Both SNPs and cSNPs can be identified by using the DNA Polymorphism Discovery Resource. When two random chromosomes are

836 citations


Journal ArticleDOI
TL;DR: A freely available computer program solves the problem of efficiently aligning a transcribed and spliced DNA sequence with a genomic sequence containing that gene, allowing for introns in the genomic sequence and a relatively small number of sequencing errors.
Abstract: We address the problem of efficiently aligning a transcribed and spliced DNA sequence with a genomic sequence containing that gene, allowing for introns in the genomic sequence and a relatively small number of sequencing errors. A freely available computer program, described herein, solves the problem for a 100-kb genomic sequence in a few seconds on a workstation.

764 citations


Journal ArticleDOI
TL;DR: It is suggested that functional predictions can be greatly improved by focusing on how the genes became similar in sequence (i.e., evolution) rather than on the sequence similarity itself.
Abstract: The ability to accurately predict gene function based on gene sequence is an important tool in many areas of biological research. Such predictions have become particularly important in the genomics age in which numerous gene sequences are generated with little or no accompanying experimentally determined functional information. Almost all functional prediction methods rely on the identification, characterization, and quantification of sequence similarity between the gene of interest and genes for which functional information is available. Because sequence is the prime determining factor of function, sequence similarity is taken to imply similarity of function. There is no doubt that this assumption is valid in most cases. However, sequence similarity does not ensure identical functions, and it is common for groups of genes that are similar in sequence to have diverse (although usually related) functions. Therefore, the identification of sequence similarity is frequently not enough to assign a predicted function to an uncharacterized gene; one must have a method of choosing among similar genes with different functions. In such cases, most functional prediction methods assign likely functions by quantifying the levels of similarity among genes. I suggest that functional predictions can be greatly improved by focusing on how the genes became similar in sequence (i.e., evolution) rather than on the sequence similarity itself. It is well established that many aspects of comparative biology can benefit from evolutionary studies (Felsenstein 1985), and comparative molecular biology is no exception (e.g., Altschul et al. 1989; Goldman et al. 1996). In this commentary, I discuss the use of evolutionary information in the prediction of gene function. To appreciate the potential of a phylogenomic approach to the prediction of gene function, it is necessary to first discuss how gene sequence is commonly used to predict gene function and some general features about gene evolution.

608 citations


Journal ArticleDOI
TL;DR: A genome-wide survey of Saccharomyces cerevisiae retrotransposons offers the first opportunity to view organizational and evolutionary trends among retrotranspoons at the genome level, and it is hoped the compiled data will serve as a starting point for further investigation and for comparison to other, more complex genomes.
Abstract: We conducted a genome-wide survey of Saccharomyces cerevisiae retrotransposons and identified a total of 331 insertions, including 217 Ty1, 34 Ty2, 41 Ty3, 32 Ty4, and 7 Ty5 elements. Eighty-five percent of insertions were solo long terminal repeats (LTRs) or LTR fragments. Overall, retrotransposon sequences constitute >377 kb or 3.1% of the genome. Independent evolution of retrotransposon sequences was evidenced by the identification of a single-base pair insertion/deletion that distinguishes the highly similar Ty1 and Ty2 LTRs and the identification of a distinct Ty1 subfamily (Ty18). Whereas Ty1, Ty2, and Ty5 LTRs displayed a broad range of sequence diversity (typically ranging from 70%‐99% identity), Ty3 and Ty4 LTRs were highly similar within each element family (most sharing >96% nucleotide identity). Therefore, Ty3 and Ty4 may be more recent additions to the S. cerevisiae genome and perhaps entered through horizontal transfer or past polyploidization events. Distribution of Ty elements is distinctly nonrandom: 90% of Ty1, 82% of Ty2, 95% of Ty3, and 88% of Ty4 insertions were found within 750 bases of tRNA genes or other genes transcribed by RNA polymerase III. tRNA genes are the principle determinant of retrotransposon distribution, and there is, on average, 1.2 insertions per tRNA gene. Evidence for recombination was found near many Ty elements, particularly those not associated with tRNA gene targets. For these insertions, 58- and 38-flanking sequences were often duplicated and rearranged among multiple chromosomes, indicating that recombination between retrotransposons can influence genome organization. S. cerevisiae offers the first opportunity to view organizational and evolutionary trends among retrotransposons at the genome level, and we hope our compiled data will serve as a starting point for further investigation and for comparison to other, more complex genomes.

539 citations


Journal ArticleDOI
TL;DR: It is demonstrated that for sibships with parents, only the parents require individual genotyping to derive the TDT statistic, whereas all the offspring can be pooled, which can potentially lead to considerable savings in genotypes, especially for multiplex sibship.
Abstract: We consider statistics for analyzing a variety of family-based and nonfamily-based designs for detecting linkage disequilibrium of a marker with a disease susceptibility locus. These designs include sibships with parents, sibships without parents, and use of unrelated controls. We also provide formulas for and evaluate the relative power of different study designs using these statistics. In this first paper in the series, we derive statistical tests based on data derived from DNA pooling experiments and describe their characteristics. Although designs based on affected and unaffected sibs without parents are usually robust to population stratification, they suffer a loss of power compared with designs using parents or unrelateds as controls. Although increasing the number of unaffected sibs improves power, the increase is generally not substantial. Designs including sibships with multiple affected sibs are typically the most powerful, with any of these control groups, when the disease allele frequency is low. When the allele frequency is high, however, designs with unaffected sibs as controls do not retain this advantage. In designs with parents, having an affected parent has little impact on the power, except for rare dominant alleles, where the power is increased compared with families with no affected parents. Finally, we also demonstrate that for sibships with parents, only the parents require individual genotyping to derive the TDT statistic, whereas all the offspring can be pooled. This can potentially lead to considerable savings in genotyping, especially for multiplex sibships. The formulas and tables we derive should provide some guidance to investigators designing nuclear family-based linkage disequilibrium studies for complex diseases.

399 citations


Journal ArticleDOI
TL;DR: A new sequence pattern discovery algorithm is developed that searches exhaustively for a priori unknown regular expression-type patterns that are over-represented in a given set of sequences.
Abstract: We performed a systematic analysis of gene upstream regions in the yeast genome for occurrences of regular expression-type patterns with the goal of identifying potential regulatory elements. To achieve this goal, we have developed a new sequence pattern discovery algorithm that searches exhaustively for a priori unknown regular expression-type patterns that are over-represented in a given set of sequences. We applied the algorithm in two cases, (1) discovery of patterns in the complete set of >6000 sequences taken upstream of the putative yeast genes and (2) discovery of patterns in the regions upstream of the genes with similar expression profiles. In the first case, we looked for patterns that occur more frequently in the gene upstream regions than in the genome overall. In the second case, first we clustered the upstream regions of all the genes by similarity of their expression profiles on the basis of publicly available gene expression data and then looked for sequence patterns that are over-represented in each cluster. In both cases we considered each pattern that occurred at least in some minimum number of sequences, and rated them on the basis of their over-representation. Among the highest rating patterns, most have matches to substrings in known yeast transcription factor-binding sites. Moreover, several of them are known to be relevant to the expression of the genes from the respective clusters. Experiments on simulated data show that the majority of the discovered patterns are not expected to occur by chance.

368 citations


Journal ArticleDOI
TL;DR: In this paper, a method for single-nucleotide poly-morphisms analysis was proposed, based on reading bits of genetic information, which can be found in the Appendix.
Abstract: Reading bits of genetic information: Methods for single-nucleotide poly-mor-phisms analysis.

Journal ArticleDOI
TL;DR: The present review offers a synopsis of the current state of knowledge of all synuclein family members in different species.
Abstract: The synuclein gene family recently came into the spotlight, when one of its members, alpha-synuclein, was found to be mutated in several families with autosomal dominant Parkinson's disease (PD). A peptide of the alpha-synuclein protein had been characterized previously as a major component of amyloid plaques in brains of patients with Alzheimer's disease (AD). The mechanism by which this presynaptic protein is involved in the two most common neurodegenerative disorders, AD and PD, remains unclear. Remarkably, another member of this gene family, gamma-synuclein, has been shown to be overexpressed in breast carcinomas and may also be overexpressed in ovarian cancer. The possible involvement of the synuclein proteins in the etiology of common human diseases has raised exciting questions and is the subject of intense investigation. Details of the properties of any member of the synuclein family may provide useful information for understanding the characteristics and function of other family members. The present review offers a synopsis of the current state of knowledge of all synuclein family members in different species.

Journal ArticleDOI
TL;DR: These models, including the analysis of observed DNA base and amino acid mutation patterns, the concept of site heterogeneity, and the incorporation of structural biology data, all of which have become particularly important in recent years are discussed.
Abstract: Phylogenetic reconstruction is a fast-growing field that is enriched by different statistical approaches and by findings and applications in a broad range of biological areas. Fundamental to these are the mathematical models used to describe the patterns of DNA base substitution and amino acid replacement. These may become some of the basic models for comparative genome research. We discuss these models, including the analysis of observed DNA base and amino acid mutation patterns, the concept of site heterogeneity, and the incorporation of structural biology data, all of which have become particularly important in recent years. We also describe the use of such models in phylogenetic reconstruction and statistical methods for the comparison of different models.

Journal ArticleDOI
TL;DR: From the phylogenetic distribution of the proteins encoded by the completely sequenced bacterial and archaeal genomes, the existence of an ancestral protein kinase prior to the divergence of eukaryotes, bacteria, and archaea is inferred.
Abstract: The central role of serine/threonine and tyrosine protein kinases in signal transduction and cellular regulation in eukaryotes is well established and widely documented. Considerably less is known about the prevalence and role of these protein kinases in bacteria and archaea. In order to examine the evolutionary origins of the eukaryotic-type protein kinase (ePK) superfamily, we conducted an extensive analysis of the proteins encoded by the completely sequenced bacterial and archaeal genomes. We detected five distinct families of known and predicted putative protein kinases with representatives in bacteria and archaea that share a common ancestry with the eukaryotic protein kinases. Four of these protein families have not been identified previously as protein kinases. From the phylogenetic distribution of these families, we infer the existence of an ancestral protein kinase(s) prior to the divergence of eukaryotes, bacteria, and archaea.

Journal ArticleDOI
TL;DR: This work demonstrates the general point that DNA microarrays that sequence important genomic regions (such as drug resistance or pathogenicity islands) can simultaneously identify species and provide some insight into the organism's population structure.
Abstract: High-density oligonucleotide arrays can be used to rapidly examine large amounts of DNA sequence in a high throughput manner. An array designed to determine the specific nucleotide sequence of 705 bp of the rpoB gene of Mycobacterium tuberculosis accurately detected rifampin resistance associated with mutations of 44 clinical isolates of M. tuberculosis. The nucleotide sequence diversity in 121 Mycobacterial isolates (comprised of 10 species) was examined by both conventional dideoxynucleotide sequencing of the rpoB and 16S genes and by analysis of the rpoB oligonucleotide array hybridization patterns. Species identification for each of the isolates was similar irrespective of whether 16S sequence, rpoB sequence, or the pattern of rpoB hybridization was used. However, for several species, the number of alleles in the 16S and rpoB gene sequences provided discordant estimates of the genetic diversity within a species. In addition to confirming the array's intended utility for sequencing the region of M. tuberculosis that confers rifampin resistance, this work demonstrates that this array can identify the species of nontuberculous Mycobacteria. This demonstrates the general point that DNA microarrays that sequence important genomic regions (such as drug resistance or pathogenicity islands) can simultaneously identify species and provide some insight into the organism's population structure.

Journal ArticleDOI
TL;DR: Recruitment of enzymes that catalyze a similar but distinct reaction seems to be a major scenario for the evolution of analogous enzymes, which should be taken into account for functional annotation of genomes.
Abstract: It is known that the same reaction may be catalyzed by structurally unrelated enzymes. We performed a systematic search for such analogous (as opposed to homologous) enzymes by evaluating sequence conservation among enzymes with the same enzyme classification (EC) number using sensitive, iterative sequence database search methods. Enzymes without detectable sequence similarity to each other were found for 105 EC numbers (a total of 243 distinct proteins). In 34 cases, independent evolutionary origin of the suspected analogous enzymes was corroborated by showing that they possess different structural folds. Analogous enzymes were found in each class of enzymes, but their overall distribution on the map of biochemical pathways is patchy, suggesting multiple events of gene transfer and selective loss in evolution, rather than acquisition of entire pathways catalyzed by a set of unrelated enzymes. Recruitment of enzymes that catalyze a similar but distinct reaction seems to be a major scenario for the evolution of analogous enzymes, which should be taken into account for functional annotation of genomes. For many analogous enzymes, the bacterial form of the enzyme is different from the eukaryotic one; such enzymes may be promising targets for the development of new antibacterial drugs.

Journal ArticleDOI
TL;DR: The use of Zoo-FISH to identify regions of chromosomal homology has allowed the transfer of information from map-rich species such as human and mouse to a wide variety of other species, and provided a basis for developing a picture of the ancestral mammalian karyotype.
Abstract: Although gene maps for a variety of evolutionarily diverged mammalian species have expanded rapidly during the past few years, until recently it has been difficult to precisely define chromosomal segments that are homologous between species. A solution to this problem has come from the development of Zoo-FISH, also known as cross-species chromosome painting. The use of Zoo-FISH to identify regions of chromosomal homology has allowed the transfer of information from map-rich species such as human and mouse to a wide variety of other species. From a Zoo-FISH analysis spanning four mammalian orders (Primates, Artiodactyla, Carnivora, and Perissodactyla), and involving eight species (human, pig, cattle, Indian muntjac, cat, American mink, harbor seal, and horse), three distinct classes of synteny conservation have been designated: (1) conservation of whole chromosome synteny, (2) conservation of large chromosomal blocks, and (3) conservation of neighboring segment combinations. This analysis has also made it possible to identify a set of chromosome segments (based on human chromosome equivalents) that probably made up the karyotype of the common ancestor of the four orders. This approach provides a basis for developing a picture of the ancestral mammalian karyotype, but a full understanding will depend on studies encompassing more diverse combinations of mammalian orders.

Journal ArticleDOI
TL;DR: The discovery of A-genome repeats in G. gossypioides adds genome-wide support to a suggestion previously based on evidence from only a single genetic locus that this species may be either the closest living descendant of the New World cotton ancestor, or an adulterated relic of polyploid formation.
Abstract: Polyploid formation has played a major role in the evolution of many plant and animal genomes; however, surprisingly little is known regarding the subsequent evolution of DNA sequences that become newly united in a common nucleus. Of particular interest is the repetitive DNA fraction, which accounts for most nuclear DNA in higher plants and animals and which can be remarkably different, even in closely related taxa. In one recently formed polyploid, cotton (Gossypium barbadense L.; AD genome), 83 non-cross-hybridizing DNA clones contain dispersed repeats that are estimated to comprise about 24% of the nuclear DNA. Among these, 64 (77%) are largely restricted to diploid taxa containing the larger A genome and collectively account for about half of the difference in DNA content between Old World (A) and New World (D) diploid ancestors of cultivated AD tetraploid cotton. In tetraploid cotton, FISH analysis showed that some A-genome dispersed repeats appear to have spread to D-genome chromosomes. Such spread may also account for the finding that one, and only one, D-genome diploid cotton, Gossypium gossypioides, contains moderate levels of (otherwise) A-genome-specific repeats in addition to normal levels of D-genome repeats. The discovery of A-genome repeats in G. gossypioides adds genome-wide support to a suggestion previously based on evidence from only a single genetic locus that this species may be either the closest living descendant of the New World cotton ancestor, or an adulterated relic of polyploid formation. Spread of dispersed repeats in the early stages of polyploid formation may provide a tag to identify diploid progenitors of a polyploid. Although most repetitive clones do not correspond to known DNA sequences, 4 correspond to known transposons, most contain internal subrepeats, and at least 12 (including 2 of the possible transposons) hybridize to mRNAs expressed at readily discernible levels in cotton seedlings, implicating transposition as one possible mechanism of spread. Integration of molecular, phylogenetic, and cytogenetic analysis of dispersed repetitive DNA may shed new light on evolution of other polyploid genomes, as well as providing valuable landmarks for many aspects of genome analysis.

Journal ArticleDOI
TL;DR: Ten(5) bp of DNA from clones containing human, sheep, and mouse PrP genes isolated in cosmids or lambda phage is sequenced and sequences in noncoding DNA that are conserved between the three species and may represent biologically functional sites are identified.
Abstract: The prion protein (PrP), first identified in scrapie-infected rodents, is encoded by a single exon of a single-copy chromosomal gene. In addition to the protein-coding exon, PrP genes in mammals contain one or two 5'-noncoding exons. To learn more about the genomic organization of regions surrounding the PrP exons, we sequenced 10(5) bp of DNA from clones containing human, sheep, and mouse PrP genes isolated in cosmids or lambda phage. Our findings are as follows: (1) Although the human PrP transcript does not include the untranslated exon 2 found in its mouse and sheep counterparts, the large intron of the human PrP gene contains an exon 2-like sequence flanked by consensus splice acceptor and donor sites. (2) The mouse Prnpa but not the Prnpb allele found in 44 inbred lines contains a 6593 nucleotide retroviral genome inserted into the anticoding strand of intron 2. This intracisternal A-particle element is flanked by duplications of an AAGGCT nucleotide motif. (3) We found that the PrP gene regions contain from 40% to 57% genome-wide repetitive elements that independently increased the size of the locus in all three species by numerous mutations. The unusually long sheep PrP 3'-untranslated region contains a "fossil" 1.2-kb mariner transposable element. (4) We identified sequences in noncoding DNA that are conserved between the three species and may represent biologically functional sites.

Journal ArticleDOI
TL;DR: The development of a homogeneous DNA detection method that requires no further manipulations after the initial reaction is set up and can be automated for high-throughput genotyping in large-scale population studies is described.
Abstract: Single-nucleotide variations are the most widely distributed genetic markers in the human genome. A subset of these variations, the substitution mutations, are responsible for most genetic disorders. As single nucleotide polymorphism (SNP) markers are being developed for molecular diagnosis of genetic disorders and large-scale population studies for genetic analysis of complex traits, a simple, sensitive, and specific test for single nucleotide changes is highly desirable. In this report we describe the development of a homogeneous DNA detection method that requires no further manipulations after the initial reaction is set up. This assay, named dye-labeled oligonucleotide ligation (DOL), combines the PCR and the oligonucleotide ligation reaction in a two-stage thermal cycling sequence with fluorescence resonance energy transfer (FRET) detection monitored in real time. Because FRET occurs only when the donor and acceptor dyes are in close proximity, one can infer the genotype or mutational status of a DNA sample by monitoring the specific ligation of dye-labeled oligonucleotide probes. We have successfully applied the DOL assay to genotype 10 SNPs or mutations. By designing the PCR primers and ligation probes in a consistent manner, multiple assays can be done under the same thermal cycling conditions. The standardized design and execution of the DOL assay means that it can be automated for high-throughput genotyping in large-scale population studies.

Journal ArticleDOI
TL;DR: To accelerate gene discovery and facilitate genetic mapping in the protozoan parasite Toxoplasma gondii, >7000 new ESTs from the 5' ends of randomly selected tachyzoite cDNAs are generated, with success in identifying new genes.
Abstract: To accelerate gene discovery and facilitate genetic mapping in the protozoan parasite Toxoplasma gondii, we have generated >7000 new ESTs from the 5' ends of randomly selected tachyzoite cDNAs. Comparison of the ESTs with the existing gene databases identified possible functions for more than 500 new T. gondii genes by virtue of sequence motifs shared with conserved protein families, including factors involved in transcription, translation, protein secretion, signal transduction, cytoskeleton organization, and metabolism. Despite this success in identifying new genes, more than 50% of the ESTs correspond to genes of unknown function, reflecting the divergent evolutionary status of this parasite. A newly recognized class of genes was identified based on its similarity to sequences known only from other members of the same phylum, therefore identifying sequences that are apparently restricted to the Apicomplexa. Such genes may underlie pathways common to this group of medically important parasites, therefore identifying potential targets for intervention.

Journal ArticleDOI
TL;DR: The dominant white phenotype in domestic pigs is caused by two mutations in the KIT gene encoding the mast/stem cell growth factor receptor (MGF), one gene duplication associated with a partially dominant phenotype and a splice mutation in one of the copies leading to the fully dominant allele.
Abstract: The change of phenotypic traits in domestic animals and crops as a response to selective breeding mimics the much slower evolutionary change in natural populations. Here, we describe that the dominant white phenotype in domestic pigs is caused by two mutations in the KIT gene encoding the mast/stem cell growth factor receptor (MGF), one gene duplication associated with a partially dominant phenotype and a splice mutation in one of the copies leading to the fully dominant allele. The splice mutation is a G to A substitution in the first nucleotide of intron 17 and leads to skipping of exon 17. The duplication is most likely a regulatory mutation affecting KIT expression, whereas the splice mutation is expected to cause a receptor with impaired or absent tyrosine kinase activity. Immunocytochemistry showed that this variant form is expressed in 17- to 19-day-old pig embryos. Hundreds of millions of white pigs around the world are assumed to be heterozygous or homozygous for the two mutations. [The EMBL accession numbers for porcine KIT1*0101, KIT1*0202, KIT2*0202, and KIT2*0101 are AJ223228-AJ223231, respectively.]

Journal ArticleDOI
TL;DR: The results clearly indicate that developing SNP markers from overlapping genomic sequence is highly efficient and cost effective, requiring only the two simple steps of developing STSs around the known SNPs and characterizing them in the appropriate populations.
Abstract: An efficient strategy to develop a dense set of single-nucleotide polymorphism (SNP) markers is to take advantage of the human genome sequencing effort currently under way. Our approach is based on the fact that bacterial artificial chromosomes (BACs) and P1-based artificial chromosomes (PACs) used in long-range sequencing projects come from diploid libraries. If the overlapping clones sequenced are from different lineages, one is comparing the sequences from 2 homologous chromosomes in the overlapping region. We have analyzed in detail every SNP identified while sequencing three sets of overlapping clones found on chromosome 5p15.2, 7q21-7q22, and 13q12-13q13. In the 200.6 kb of DNA sequence analyzed in these overlaps, 153 SNPs were identified. Computer analysis for repetitive elements and suitability for STS development yielded 44 STSs containing 68 SNPs for further study. All 68 SNPs were confirmed to be present in at least one of the three (Caucasian, African-American, Hispanic) populations studied. Furthermore, 42 of the SNPs tested (62%) were informative in at least one population, 32 (47%) were informative in two or more populations, and 23 (34%) were informative in all three populations. These results clearly indicate that developing SNP markers from overlapping genomic sequence is highly efficient and cost effective, requiring only the two simple steps of developing STSs around the known SNPs and characterizing them in the appropriate populations.

Journal ArticleDOI
TL;DR: This comprehensive map of goat chromosomes will speed up positional cloning projects in domestic ruminants and clarify some aspects of mammalian chromosomal evolution.
Abstract: A total of 202 genes were cytogenetically mapped to goat chromosomes, multiplying by five the total number of regional gene localizations in domestic ruminants (255). This map encompasses 249 and 173 common anchor loci regularly spaced along human and murine chromosomes, respectively, which makes it possible to perform a genome-wide comparison between three mammalian orders. Twice as many rearrangements as revealed by ZOO-FISH were observed. The average size of conserved fragments could be estimated at 27 and 8 cM with humans and mice, respectively. The position of evolutionary breakpoints often correspond with human chromosome sites known to be vulnerable to rearrangement in neoplasia. Furthermore, 75 microsatellite markers, 30 of which were isolated from gene-containing bacterial artificial chromosomes (BACs), were added to the previous goat genetic map, achieving 88% genome coverage. Finally, 124 microsatellites were cytogenetically mapped, which made it possible to physically anchor and orient all the linkage groups. We believe that this comprehensive map will speed up positional cloning projects in domestic ruminants and clarify some aspects of mammalian chromosomal evolution. [The sequence data described in this paper have been submitted to the GenBank data library under accession nos. G40978‐G41020, AF083170‐AF083184, AF088286, AF08287, AF083401‐AF083406, AF082884, and AF082885.]

Journal ArticleDOI
TL;DR: Phylogenetic analyses of the str and stl families, and comparisons with a few orthologs in Caenorhabditis briggsae, reveal ongoing processes of gene duplication, diversification, and movement.
Abstract: The str family of genes encoding seven-transmembrane G-protein-coupled or serpentine receptors related to the ODR-10 diacetyl chemoreceptor is very large, with at least 197 members in the Caenorhabditis elegans genome. The closely related stl family has 43 genes, and both families are distantly related to the srd family with 55 genes. Analysis of the structures of these genes indicates that a third of them are clearly or likely pseudogenes. Preliminary surveys of other candidate chemoreceptor families indicates that as many as 800 genes and pseudogenes or 6% of the genome might encode 550 functional chemoreceptors constituting 4% of the C. elegans protein complement. Phylogenetic analyses of the str and stl families, and comparisons with a few orthologs in Caenorhabditis briggsae, reveal ongoing processes of gene duplication, diversification, and movement. The reconstructed ancestral gene structures for these two families have eight introns each, four of which are homologous. Mapping of intron distributions on the phylogenetic tree reveals that each intron has been lost many times independently. Most of these introns were lost individually, which might best be explained by precise in-frame deletions involving nonhomologous recombination between short direct repeats at their termini. [Alignment of the putatively functional proteins in the str and stl families is available from Pfam (http://genome. wustl.edu/Pfam); alignments of all translations are available at http://cshl.org/gr; alignments of the genes are available from the author at hughrobe@uiuc.edu]

Journal ArticleDOI
TL;DR: This map provides a foundation for the study of the possible roles of ribosomal protein deficiencies in chromosomal and Mendelian disorders.
Abstract: We mapped 75 genes that collectively encode >90% of the proteins found in human ribosomes. Because localization of ribosomal protein genes (rp genes) is complicated by the existence of processed pseudogenes, multiple strategies were devised to identify PCR-detectable sequence-tagged sites (STSs) at introns. In some cases we exploited specific, pre-existing information about the intron/exon structure of a given human rp gene or its homolog in another vertebrate. When such information was unavailable, selection of PCR primer pairs was guided by general insights gleaned from analysis of all mammalian rp genes whose intron/exon structures have been published. For many genes, PCR amplification of introns was facilitated by use of YAC pool DNAs rather than total human genomic DNA as templates. We then assigned the rp gene STSs to individual human chromosomes by typing human‐rodent hybrid cell lines. The genes were placed more precisely on the physical map of the human genome by typing of radiation hybrids or screening YAC libraries. Fifty-one previously unmapped rp genes were localized, and 24 previously reported rp gene localizations were confirmed, refined, or corrected. Though functionally related and coordinately expressed, the 75 mapped genes are widely dispersed: Both sex chromosomes and at least 20 of the 22 autosomes carry one or more rp genes. Chromosome 19, known to have a high gene density, contains an unusually large number of rp genes (12). This map provides a foundation for the study of the possible roles of ribosomal protein deficiencies in chromosomal and Mendelian disorders. [The sequence data described in this paper have been submitted to GenBank. They are listed in Table 1.]

Journal ArticleDOI
TL;DR: Using the recently introduced BigDye terminators, large-template DNA can be directly sequenced with custom primers on automated instruments without additional manipulations of template DNA, thereby bypassing tedious subcloning steps.
Abstract: In microbial genome or large-insert clone sequencing projects that use the predominant random subclone sequencing strategy, progress tends to decrease dramatically at late stages as one confronts gaps. At these points, DNA is under-represented or unstable in subclones (E.Y. Chen et al. 1996; Chissoe et al. 1997). Further sequencing with additional random subclones is then inefficient at best, and one must frequently employ alternative cloning systems or additional methods like long-range PCR to recover missing DNA (C.N. Chen et al. 1996). The variability of performance of these methods and the necessity for custom-tailored work tend to hamper the late stages of sequencing efforts. In contrast, if one can sequence directly from genomic DNA (or large-insert clones such as BACs or PACs) with walking primers, cumbersome work to fill gaps could be completed in a much shorter time. As an example, in a recent project to sequence the 750-kb genome of Ureaplasma urealyticum (J. Glass, in prep.) assemblage of ∼13,000 sequence reads and combinatorial PCR reactions to join contigs left two gaps. No λ pUC, or M13 subclones were recovered that spanned the gaps, nor were PCR products derived with any of several sets of flanking primers. The difficulty of cloning these segments is probably attributable to repeated sequences in and near the two gaps, but the high sensitivity of the recently introduced BigDye terminator (Rosenblum et al. 1997) permitted direct sequencing of the gap regions on genomic U. urealyticum DNA templates. Using the conditions described in this report, two gaps of 259 and 121 bp were sequenced from both strands with walking primers to complete the project of 751,723 bp. Direct sequencing was further tested for larger templates, and good results were reproducibly obtained with 1.2-Mb Mycoplasma fermentans, 2.3-Mb Streptococcus pneumoniae, and 4.6-Mb Escherichia coli genomic DNA (see example in Fig. ​Fig.1).1). In addition, several difficult gaps in sequencing projects with BAC clones, ranging in size from 140 to 250 kb, have also been filled in this manner. Essentially the method is applicable whenever 2–3 μg of high-quality large-template DNA is available. Figure 1 Sequencing of E. coli K12 strain genomic DNA with BigDye terminators. Approximately 3 μg of E. coli DNA was sequenced with an apaG gene primer (5′-GTTCCCACACTCATTCATTA) using the conditions described in the text.

Journal ArticleDOI
TL;DR: The biosynthesis pathways of all 20 amino acids were completely reconstructed in Escherichia coli, Haemophilus influenzae, and Bacillus subtilis, and probably in Synechocystis and Saccharomyces cerevisiae as well, although it was necessary to assume wider substrate specificity for aspartate aminotransferases.
Abstract: The complete genome sequence of an organism contains information that has not been fully utilized in the current prediction methods of gene functions, which are based on piece-by-piece similarity searches of individual genes. We present here a method that utilizes a higher level information of molecular pathways to reconstruct a complete functional unit from a set of genes. Specifically, a genome-by-genome comparison is first made for identifying enzyme genes and assigning EC numbers, which is followed by the reconstruction of selected portions of the metabolic pathways by use of the reference biochemical knowledge. The completeness of the reconstructed pathway is an indicator of the correctness of the initial gene function assignment. This feature has become possible because of our efforts to computerize the current knowledge of metabolic pathways under the KEGG project. We found that the biosynthesis pathways of all 20 amino acids were completely reconstructed in Escherichia coli, Haemophilus influenzae, and Bacillus subtilis, and probably in Synechocystis and Saccharomyces cerevisiae as well, although it was necessary to assume wider substrate specificity for aspartate aminotransferases.

Journal ArticleDOI
TL;DR: Analysis of sequence substitutions and evolutionary distances in this data set revealed that most C. elegans genes are evolving more rapidly than Drosophila genes, suggesting that unequal evolutionary rates may contribute to the differences in similarity to human protein sequences.
Abstract: Comparisons of DNA and protein sequences between humans and model organisms, including the yeast Saccharomyces cerevisiae, the nematode Caenorhabditis elegans, and the fruit fly Drosophila melanogaster, are a significant source of information about the function of human genes and proteins in both normal and disease states. Important questions regarding cross-species sequence comparison remain unanswered, including (1) the fraction of the metabolic, signaling, and regulatory pathways that is shared by humans and the various model organisms; and (2) the validity of functional inferences based on sequence homology. We addressed these questions by analyzing the available fractions of human, fly, nematode, and yeast genomes for orthologous protein-coding genes, applying strict criteria to distinguish between candidate orthologous and paralogous proteins. Forty-two quartets of proteins could be identified as candidate orthologs. Twenty-four Drosophila protein sequences were more similar to their human orthologs than the corresponding nematode proteins. Analysis of sequence substitutions and evolutionary distances in this data set revealed that most C. elegans genes are evolving more rapidly than Drosophila genes, suggesting that unequal evolutionary rates may contribute to the differences in similarity to human protein sequences. The available fraction of Drosophila proteins appears to lack representatives of many protein families and domains, reflecting the relative paucity of genomic data from this species.

Journal ArticleDOI
TL;DR: An F2 population derived from Dahl salt-sensitive and Lewis rats was raised on a 8% NaCl diet for 9 weeks and analyzed for blood pressure quantitative trait loci (QTL) by use of a whole genome scan, proving the existence of QTL on these chromosomes.
Abstract: An F2 population (n = 151) derived from Dahl salt-sensitive (S) and Lewis rats was raised on a 8% NaCl diet for 9 weeks and analyzed for blood pressure quantitative trait loci (QTL) by use of a whole genome scan. Chromosomes 5 and 10 yielded lod scores for linkage to blood pressure that were significant; chromosomes 1, 2, 3, 8, 16, 17, and 18 gave lod scores suggestive for linkage. Chromosome 7 gave a significant signal for heart weight with a lesser effect on blood pressure. Congenic strains were constructed by introgressing Lewis low-blood-pressure QTL alleles for chromosomes 1, 5, 10, and 17 into the S genetic background. Congenic strains for chromosomes 1, 5, and 10 had significantly lower blood pressure than S, proving the existence of QTL on these chromosomes, but the chromosome 17 congenic strain failed to trap a contrasting QTL allele. The QTL allele increasing blood pressure originated from S rats for all QTL except those on chromosomes 2 and 7 in which the Lewis allele increased blood pressure. Interactions between each QTL and every other locus in the genome scan yielded significant interactions between chromosomes 10 and 4, and between chromosomes 2 and 3.