scispace - formally typeset
Search or ask a question

Showing papers in "Genome Research in 2001"


Journal ArticleDOI
TL;DR: A tool that uses sequence homology to predict whether a substitution affects protein function is constructed, which may be used to identify plausible disease candidates among the SNPs that cause missense substitutions.
Abstract: Many missense substitutions are identified in single nucleotide polymorphism (SNP) data and large-scale random mutagenesis projects. Each amino acid substitution potentially affects protein function. We have constructed a tool that uses sequence homology to predict whether a substitution affects protein function. SIFT, which sorts intolerant from tolerant substitutions, classifies substitutions as tolerated or deleterious. A higher proportion of substitutions predicted to be deleterious by SIFT gives an affected phenotype than substitutions predicted to be deleterious by substitution scoring matrices in three test cases. Using SIFT before mutagenesis studies could reduce the number of functional assays required and yield a higher proportion of affected phenotypes. may be used to identify plausible disease candidates among the SNPs that cause missense substitutions.

2,374 citations


Journal ArticleDOI
TL;DR: The current knowledge of the human ABC genes, their role in inherited disease, and understanding of the topology of these genes within the membrane are reviewed.
Abstract: The ATP-binding cassette (ABC) transporter superfamily contains membrane proteins that translocate a variety of substrates across extra- and intra-cellular membranes. Genetic variation in these genes is the cause of or contributor to a wide variety of human disorders with Mendelian and complex inheritance, including cystic fibrosis, neurological disease, retinal degeneration, cholesterol and bile transport defects, anemia, and drug response. Conservation of the ATP-binding domains of these genes has allowed the identification of new members of the superfamily based on nucleotide and protein sequence homology. Phylogenetic analysis is used to divide all 48 known ABC transporters into seven distinct subfamilies of proteins. For each gene, the precise map location on human chromosomes, expression data, and localization within the superfamily has been determined. These data allow predictions to be made as to potential functions or disease phenotypes associated with each protein. In this paper, we review the current state of knowledge on all human ABC genes in inherited disease and drug resistance. In addition, the availability of the complete Drosophila genome sequence allows the comparison of the known human ABC genes with those in the fly genome. The combined data enable an evolutionary analysis of the superfamily. Complete characterization of all ABC from the human genome and from model organisms will lead to important insights into the physiology and the molecular basis of many human disorders.

1,751 citations


Journal ArticleDOI
TL;DR: A set of 200 Class I SSR markers was developed and integrated into the existing microsatellite map of rice, providing immediate links between the genetic, physical, and sequence-based maps.
Abstract: A total of 57.8 Mb of publicly available rice (Oryza sativa L.) DNA sequence was searched to determine the frequency and distribution of different simple sequence repeats (SSRs) in the genome. SSR loci were categorized into two groups based on the length of the repeat motif. Class I, or hypervariable markers, consisted of SSRs > or =20 bp, and Class II, or potentially variable markers, consisted of SSRs > or =12 bp <20 bp. The occurrence of Class I SSRs in end-sequences of EcoRI- and HindIII-digested BAC clones was one SSR per 40 Kb, whereas in continuous genomic sequence (represented by 27 fully sequenced BAC and PAC clones), the frequency was one SSR every 16 kb. Class II SSRs were estimated to occur every 3.7 kb in BAC ends and every 1.9 kb in fully sequenced BAC and PAC clones. GC-rich trinucleotide repeats (TNRs) were most abundant in protein-coding portions of ESTs and in fully sequenced BACs and PACs, whereas AT-rich TNRs showed no such preference, and di- and tetranucleotide repeats were most frequently found in noncoding, intergenic regions of the rice genome. Microsatellites with poly(AT)n repeats represented the most abundant and polymorphic class of SSRs but were frequently associated with the Micropon family of miniature inverted-repeat transposable elements (MITEs) and were difficult to amplify. A set of 200 Class I SSR markers was developed and integrated into the existing microsatellite map of rice, providing immediate links between the genetic, physical, and sequence-based maps. This contribution brings the number of microsatellite markers that have been rigorously evaluated for amplification, map position, and allelic diversity in Oryza spp. to a total of 500.

1,495 citations


Journal ArticleDOI
TL;DR: A simple method of using rolling circle amplification to amplify vector DNA such as M13 or plasmid DNA from single colonies or plaques is described, which removes the need for lengthy growth periods and traditional DNA isolation methods.
Abstract: We describe a simple method of using rolling circle amplification to amplify vector DNA such as M13 or plasmid DNA from single colonies or plaques. Using random primers and phi29 DNA polymerase, circular DNA templates can be amplified 10,000-fold in a few hours. This procedure removes the need for lengthy growth periods and traditional DNA isolation methods. Reaction products can be used directly for DNA sequencing after phosphatase treatment to inactivate unincorporated nucleotides. Amplified products can also be used for in vitro cloning, library construction, and other molecular biology applications.

1,107 citations


Journal ArticleDOI
TL;DR: Genomic sequence revealed new possibilities for fermentation pathways and for aerobic respiration and indicated a horizontal transfer of genetic information from Lactococcus to gram-negative enteric bacteria of Salmonella-Escherichia group.
Abstract: Lactococcus lactis is a nonpathogenic AT-rich gram-positive bacterium closely related to the genus Streptococcus and is the most commonly used cheese starter. It is also the best-characterized lactic acid bacterium. We sequenced the genome of the laboratory strain IL1403, using a novel two-step strategy that comprises diagnostic sequencing of the entire genome and a shotgun polishing step. The genome contains 2,365,589 base pairs and encodes 2310 proteins, including 293 protein-coding genes belonging to six prophages and 43 insertion sequence (IS) elements. Nonrandom distribution of IS elements indicates that the chromosome of the sequenced strain may be a product of recent recombination between two closely related genomes. A complete set of late competence genes is present, indicating the ability of L. lactis to undergo DNA transformation. Genomic sequence revealed new possibilities for fermentation pathways and for aerobic respiration. It also indicated a horizontal transfer of genetic information from Lactococcus to gram-negative enteric bacteria of Salmonella-Escherichia group. [The sequence data described in this paper has been submitted to the GenBank data library under accession no. AE005176.]

1,096 citations


Journal ArticleDOI
TL;DR: The results of computational experiments are presented which show that SSAHA can be three to four orders of magnitude faster than BLAST or FASTA, while requiring less memory than suffix tree methods.
Abstract: We describe an algorithm, SSAHA (Sequence Search and Alignment by Hashing Algorithm), for performing fast searches on databases containing multiple gigabases of DNA. Sequences in the database are preprocessed by breaking them into consecutive k-tuples of k contiguous bases and then using a hash table to store the position of each occurrence of each k-tuple. Searching for a query sequence in the database is done by obtaining from the hash table the "hits" for each k-tuple in the query sequence and then performing a sort on the results. We discuss the effect of the tuple length k on the search speed, memory usage, and sensitivity of the algorithm and present the results of computational experiments which show that SSAHA can be three to four orders of magnitude faster than BLAST or FASTA, while requiring less memory than suffix tree methods. The SSAHA algorithm is used for high-throughput single nucleotide polymorphism (SNP) detection and very large scale sequence assembly. Also, it provides Web-based sequence search facilities for Ensembl projects.

1,057 citations


Journal ArticleDOI
TL;DR: Key features regarding different aspects of pyrosequencing technology are considered, including the general principles, enzyme properties, sequencing modes, instrumentation, and potential applications.
Abstract: DNA sequencing is one of the most important platforms for the study of biological systems today. Sequence determination is most commonly performed using dideoxy chain termination technology. Recently, pyrosequencing has emerged as a new sequencing methodology. This technique is a widely applicable, alternative technology for the detailed characterization of nucleic acids. Pyrosequencing has the potential advantages of accuracy, flexibility, parallel processing, and can be easily automated. Furthermore, the technique dispenses with the need for labeled primers, labeled nucleotides, and gel-electrophoresis. This article considers key features regarding different aspects of pyrosequencing technology, including the general principles, enzyme properties, sequencing modes, instrumentation, and potential applications.

1,045 citations


Journal ArticleDOI
TL;DR: A systematic analysis of 929 human disease gene entries associated with at least one mutant allele in the Online Mendelian Inheritance in Man (OMIM) database against the recently completed genome sequence of Drosophila melanogaster identified 714 distinct human disease genes matching 548 unique Dosophila sequences.
Abstract: We performed a systematic BLAST analysis of 929 human disease gene entries associated with at least one mutant allele in the Online Mendelian Inheritance in Man (OMIM) database against the recently completed genome sequence of Drosophila melanogaster The results of this search have been formatted as an updateable and searchable on-line database called Homophila Our analysis identified 714 distinct human disease genes (77% of disease genes searched) matching 548 unique Drosophila sequences, which we have summarized by disease category This breakdown into disease classes creates a picture of disease genes that are amenable to study using Drosophila as the model organism Of the 548 Drosophila genes related to human disease genes, 153 are associated with known mutant alleles and 56 more are tagged by P-element insertions in or near the gene Examples of how to use the database to identify Drosophila genes related to human disease genes are presented We anticipate that cross-genomic analysis of human disease genes using the power of Drosophila second-site modifier screens will promote interaction between human and Drosophila research groups, accelerating the understanding of the pathogenesis of human genetic disease The Homophila database is available at http://homophilasdscedu

937 citations


Journal ArticleDOI
TL;DR: The results of this analysis suggest the following genome expansion history: first, the generation of a "tetrapod-specific" Class II OR cluster on chromosome 11 by local duplication, then a single-step duplication of this cluster to chromosome 1, and finally an avalanche of duplication events out of chromosome 1 to most other chromosomes.
Abstract: Olfactory receptors likely constitute the largest gene superfamily in the vertebrate genome. Here we present the nearly complete human olfactory subgenome elucidated by mining the genome draft with gene discovery algorithms. Over 900 olfactory receptor genes and pseudogenes (ORs) were identified, two-thirds of which were not annotated previously. The number of extrapolated ORs is in good agreement with previous theoretical predictions. The sequence of at least 63% of the ORs is disrupted by what appears to be a random process of pseudogene formation. ORs constitute 17 gene families, 4 of which contain more than 100 members each. "Fish-like" Class I ORs, previously considered a relic in higher tetrapods, constitute as much as 10% of the human repertoire, all in one large cluster on chromosome 11. Their lower pseudogene fraction suggests a functional significance. ORs are disposed on all human chromosomes except 20 and Y, and nearly 80% are found in clusters of 6-138 genes. A novel comparative cluster analysis was used to trace the evolutionary path that may have led to OR proliferation and diversification throughout the genome. The results of this analysis suggest the following genome expansion history: first, the generation of a "tetrapod-specific" Class II OR cluster on chromosome 11 by local duplication, then a single-step duplication of this cluster to chromosome 1, and finally an avalanche of duplication events out of chromosome 1 to most other chromosomes. The results of the data mining and characterization of ORs can be accessed at the Human Olfactory Receptor Data Exploratorium Web site (http://bioinfo.weizmann.ac.il/HORDE).

659 citations


Journal ArticleDOI
TL;DR: The above data indicate that segmental duplications represent a significant impediment to accurate human genome assembly, requiring the development of specialized techniques to finish these exceptional regions of the genome.
Abstract: A main goal of the Human Genome Project (HGP) is to provide the complete and accurate reference sequence of the euchromatic portions of all human chromosomes (Collins et al. 1998). It has been argued that this endeavor differs from previously sequenced invertebrate models not only in terms of scale but also in terms of repetitive complexity (Green 1997; Eichler 1998). Repetitive complexity leads to misassignment and misassembly of sequence. It has been suggested that segmental duplications may be particularly problematic in this regard because of their inconspicuousness, large size, and high degree of sequence similarity. The inability to identify such duplications, let alone differentiate their true position from paralogous positions, may confound sequence assembly, resulting in merging of distinct loci into the same sequence (Eichler 1998). Segmental duplications are duplicated blocks of genomic DNA typically ranging in size from 1–200 kb (IHGSC 2001). They often contain sequence features such as high-copy repeats and gene sequences with intron–exon structure. Thus, being composed of apparently normal genomic DNA, segmental duplications cannot be detected a priori; rather, most segmental duplications have to date been discovered based on experimental analyses. Over the past decade a large number of both intra- and interchromosomal segmental duplications have been observed (Wong et al. 1990; Tomlinson et al. 1994; Eichler et al. 1997; Mazzarella and Schlessinger 1997; Regnier et al. 1997; Zimonjic et al. 1997; Eichler 1998; Trask et al. 1998a; Jackson et al. 1999; Ji et al. 1999). These data suggest numerous interchromosomal exchanges during recent hominoid evolution with apparent biases into and between pericentromeric and subtelomeric regions (Eichler et al. 1997, 1999; Monfouilloux et al. 1998; Trask et al. 1998a; Jackson et al. 1999; Horvath et al. 2000a). To date, however, no systematic analysis of the genome has been performed to quantify this bias. Another unanticipated finding has been the important role segmental duplications play in disease (for review, see Ji et al. 2000; Mazzarella and Schlessinger 1998). Aberrant homologous recombination between highly similar paralogs appears to be a major mechanism for many genomic disorders such as velocardiofacial/DiGeorge, Smith-Magenis, and Prader-Willi/Angelman syndromes (Chen et al. 1997; Amos-Landgraf et al. 1999; Christian et al. 1999; Edelmann et al. 1999; Shaikh et al. 2000). A major step toward developing a final reference sequence has been the completion of the draft-sequencing phase of the HGP and its subsequent assembly. The assembly has occurred in three main steps: (1) Sequenced clones are placed into fingerprint contigs generated from the entire RPCI-11 BAC library; (2) fingerprint contigs are assigned and positioned to chromosomes using all available genetic and STS markers; and (3) the sequence within each contig is assembled by Jim Kent's Gigassembler (IHGMC 2001; IHGSC 2001). This landmark achievement has given us the ability to examine segmental duplications in a genome-wide and systematic manner. We reported an unprecedented amount (3.6%) of sequence was involved in recent segmental duplications with identity between 90%–98%. Additionally we provided examples of pericentromeric and subtelomeric regions that appear to be composed almost entirely of duplicated sequence (IHGSC 2001). However, further characterization of highly duplicated regions has yet to be accomplished. In this article, we present our methodology for the analysis of such duplications and an in-depth analysis of segmental duplications in the current working draft assembly (January 2001, oo23 assembly), paying particular attention to the quality of assignment and assembly for the duplication-rich clones and regions. Because of the estimated error rates of sequence and the potential for misassembly in the draft assembly, we consider two categories of duplications: segments with >98% nucleotide identity, and segments with 90%–98% identity. For the first time, we quantify the genome-wide enrichment of duplicated sequence in both pericentromeric and subtelomeric regions. In addition, we examine more specifically the impact of these segments on the current assembly. We find duplicated sequences are enriched in sequence contigs that have not been mapped within the current assembly. We also find that clones containing duplications are often assigned to a chromosome inconsistent with FISH and only ∼50% of the chromosomes with FISH signals from these clones have a corresponding sequence similarity by BLAST analysis. This underrepresentation may be attributable to many factors: misassignment, merging, or reduced coverage in these paralogous regions. Taken together, the clustering of duplications combined with the difficulty in positioning and assembling them, suggests that large tracts of segmental duplications, particularly those located at pericentromeres, will be refractory to currently employed assembly methods. Specialized methods will be necessary to correctly integrate these regions into the reference human genome sequence. We propose that the determination of whether an observed overlap is allelic or paralogous will facilitate the final assembly of the human genome, helping to eliminate many gaps both within paralogous as well as unique sequence regions.

657 citations


Journal ArticleDOI
TL;DR: Data is provided on phenotypic analysis of six strains of E. coli and it is shown that PM technology can detect expected phenotypes as well as, in some cases, unexpected phenotypes.
Abstract: Technologies that can provide a cell-wide perspective are very useful. Important “global” technologies using two-dimensional methods were pioneered by O'Farrell and coworkers (O'Farrell 1975) for protein analysis (proteomics), and by Fodor and coworkers (Fodor et al. 1993) for nucleic acid analysis (genomics). These technologies and subsequent refinements allow for global analysis of the important macromolecules of cells that convey the information flow from DNA to RNA to protein. However, the information initially encoded in the genome is ultimately displayed at the cellular level as cellular traits or phenotypes. This paper describes a new technology called Phenotype MicroArrays (PM) that provides an analogous two-dimensional array technology for analysis of live cells (phenomics) to measure hundreds or thousands of cellular properties simultaneously. A technology for global analysis of cellular phenotypes was first proposed by Bochner (1989b) using microplates for high-throughput assays. Two groups working in genomics of Saccharomyces cerevisiae as a model system recently have tested a large number of strains against 96 (Ross-Macdonald et al. 1999) or 288 (Reiger et al. 1997, 1999) growth phenotypes. These groups used microplate technology to test the growth of yeast strains on the surface of agar. A problem with this approach is that it is difficult to scale it efficiently for high-throughput testing. Agar media have a short shelf life and must be prepared freshly. Scoring of growth is rather subjectively and inefficiently performed with daily visual or photographic records. Estimation of apparent cell mass on a surface is difficult and can be misleading. For example, cellular changes can cause colonies to spread or secrete capsular polysaccharide, thereby appearing much larger than the actual cell number. An ideal high-throughput system would allow for automated, kinetic reading and storing of quantitative phenotypic data directly into computer databases amenable to bioinformatics analyses. At Biolog, we have employed testing of cellular phenotypes using cell respiration as a reporter system since 1984 (Bochner 1988, 1989a,b). The assay chemistry uses a tetrazolium dye, usually tetrazolium violet, to colorimetrically detect the respiration of cells. Reduction of this dye results in formation of a purple color and because the dye reduction is essentially irreversible under physiological conditions, it accumulates in the well over a period of hours, amplifying the signal and integrating the amount of breathing over time. This provides several major benefits: (1) The color change is easy to monitor; (2) the color change is easy to quantitate; (3) the color change is very sensitive and highly reproducible; and (4) cell respiration can occur independent of cell growth and, in some cases, can measure phenotypes that do not lead to growth. As part of this technology, the OmniLog instrument has been developed for the purpose of reading and recording the color change in PM assays. The instrument cycles microplates in front of a color CCD camera to read 50 in as little as 5 min and provides quantitative and kinetic information about the response of cells in the PMs. Data are stored directly into computer files and can be recalled and compared with other data at any time. Figure ​Figure11 shows how cell respiration can be coupled to a large number and a wide range of cellular phenotypes. In a normal growth situation a coordinated sequence of events must occur. Cells must transport nutrients, catabolize and reform them, produce essential small molecule components, polymerize these into macromolecules, create and assemble subcellular structures, etc. If all of these processes are working normally, the cell can grow and there will be an actual physical flow of electrons from the carbon source to NADH, down the electron transport chain of the cell, and, ultimately, onto the tetrazolium dye to produce the purple color. If one of these processes is working at a subnormal rate it will become a pinchpoint, restricting this flow and resulting in a decrease in purple color. The severity of the restriction is reflected in the degree of loss of purple color. Total loss of function will result in no growth and no purple color. Therefore, colorimetric assay of respiration can provide a virtually universal reporter system for phenomic testing. Figure 1 Respiration pathways coupled to cell physiology. About half of the genes from genomic sequencing efforts have no ascribed function, and even genes with ascribed functions are based primarily on DNA sequence homology, with little or no direct experimental data. Large-scale gene knock-out projects for S. cerevisiae (Burns et al. 1994) have been essentially completed and other projects from Bacillus subtilis (Vagner et al. 1998) to E. coli (Link et al. 1997; Datsenko and Wanner 2000; F. Blattner, pers. comm.) to mouse (Zambrowicz et al. 1998) are at various stages. We developed the PM technology anticipating that it can be used to analyze the effects of loss of gene function, providing a direct experimental linkage between genotype and phenotype. The idea is simply to compare isogenic pairs of strains in PMs to directly assay for the cellular effects of loss of gene function.

Journal ArticleDOI
TL;DR: The recently completed Drosophila genome sequence for G protein-coupled receptors sensitive to bioactive peptides (peptide GPCRs) is scanned and 44 genes are described that represent the vast majority, and perhaps all, of the peptide G PCRs encoded in the fly genome.
Abstract: Recent genetic analyses in worms, flies, and mammals illustrate the importance of bioactive peptides in controlling numerous complex behaviors, such as feeding and circadian locomotion. To pursue a comprehensive genetic analysis of bioactive peptide signaling, we have scanned the recently completed Drosophila genome sequence for G protein-coupled receptors sensitive to bioactive peptides (peptide GPCRs). Here we describe 44 genes that represent the vast majority, and perhaps all, of the peptide GPCRs encoded in the fly genome. We also scanned for genes encoding potential ligands and describe 22 bioactive peptide precursors. At least 32 Drosophila peptide receptors appear to have evolved from common ancestors of 15 monophyletic vertebrate GPCR subgroups (e.g., the ancestral gastrin/cholecystokinin receptor). Six pairs of receptors are paralogs, representing recent gene duplications. Together, these findings shed light on the evolutionary history of peptide GPCRs, and they provide a template for physiological and genetic analyses of peptide signaling in Drosophila.

Journal ArticleDOI
TL;DR: The extent to which a protein interaction map generated in one species can be used to predict interactions in another species is investigated.
Abstract: Protein interaction maps have provided insight into the relationships among the predicted proteins of model organisms for which a genome sequence is available. These maps have been useful in generating potential interaction networks, which have confirmed the existence of known complexes and pathways and have suggested the existence of new complexes and or crosstalk between previously unlinked pathways. However, the generation of such maps is costly and labor intensive. Here, we investigate the extent to which a protein interaction map generated in one species can be used to predict interactions in another species.

Journal ArticleDOI
TL;DR: A sequenced library of randomly sheared genomic DNA from maize demonstrated that the maize genome is composed of diverse sequences that represent numerous families of retrotransposons and indicated that retroelements abundant in the genome are poorly represented in hypomethylated regions.
Abstract: Long terminal repeat (LTR) retrotransposons have been shown to make up much of the maize genome. Although these elements are known to be prevalent in plant genomes of a middle-to-large size, little information is available on the relative proportions composed by specific families of elements in a single genome. We sequenced a library of randomly sheared genomic DNA from maize to characterize this genome. BLAST analysis of these sequences demonstrated that the maize genome is composed of diverse sequences that represent numerous families of retrotransposons. The largest families contain the previously described elements Huck, Ji, and Opie. Approximately 5% of the sequences are predicted to encode proteins. The genomic abundance of 16 families of elements was estimated by hybridization to an array of 10,752 maize bacterial artificial chromosome (BAC) clones. Comparisons of the number of elements present on individual BACs indicated that retrotransposons are in general randomly distributed across the maize genome. A second library was constructed that was selected to contain sequences hypomethylated in the maize genome. Sequence analysis of this library indicated that retroelements abundant in the genome are poorly represented in hypomethylated regions. Fifty-six retroelement sequences corresponding to the integrase and reverse transcriptase domains were isolated from approximately 407,000 maize expressed sequence tags (ESTs). Phylogenetic analysis of these and the genomic retroelement sequences indicated that elements most abundant in the genome are less abundant at the transcript level than are more rare retrotransposons. Additional phylogenies also demonstrated that rice and maize retrotransposon families are frequently more closely related to each other than to families within the same species. An analysis of the GC content of the maize genomic library and that of maize ESTs did not support recently published data that the gene space in maize is found within a narrow GC range, but does indicate that genic sequences have a higher GC content than intergenic sequences (52% vs. 47% GC).

Journal ArticleDOI
TL;DR: A software tool to delineate gene structures using genomically aligned EST sequences, using a novel algorithm that uses the EST-encoded connectivity and redundancy information to sort out the complex alternative splicing patterns.
Abstract: With the availability of a nearly complete sequence of the human genome, aligning expressed sequence tags (EST) to the genomic sequence has become a practical and powerful strategy for gene prediction. Elucidating gene structure is a complex problem requiring the identification of splice junctions, gene boundaries, and alternative splicing variants. We have developed a software tool, Transcript Assembly Program (TAP), to delineate gene structures using genomically aligned EST sequences. TAP assembles the joint gene structure of the entire genomic region from individual splice junction pairs, using a novel algorithm that uses the EST-encoded connectivity and redundancy information to sort out the complex alternative splicing patterns. A method called polyadenylation site scan (PASS) has been developed to detect poly-A sites in the genome. TAP uses these predictions to identify gene boundaries by segmenting the joint gene structure at polyadenylated terminal exons. Reconstructing 1007 known transcripts, TAP scored a sensitivity (Sn) of 60% and a specificity (Sp) of 92% at the exon level. The gene boundary identification process was found to be accurate 78% of the time. also reports alternative splicing patterns in EST alignments. An analysis of alternative splicing in 1124 genic regions suggested that more than half of human genes undergo alternative splicing. Surprisingly, we saw an absolute majority of the detected alternative splicing events affect the coding region. Furthermore, the evolutionary conservation of alternative splicing between human and mouse was analyzed using an EST-based approach. (See http://stl.wustl.edu/~zkan/TAP/)

Journal ArticleDOI
TL;DR: Methods for testing associations between estimated haplotype frequencies derived from multilocus genotype data and disease endpoints assuming a simple case/control sampling design are described, which suggest that haplotype information and linkage disequilibrium-induced associations between polymorphic loci that neighbor loci harboring functional sequence variants can be exploited to identify disease-predisposing alleles in large, freely mixing populations via estimated haplotypes.
Abstract: There is growing debate over the utility of multiple locus association analyses in the identification of genomic regions harboring sequence variants that influence common complex traits such as hypertension and diabetes. Much of this debate concerns the manner in which one can use the genotypic information from individuals gathered in simple sampling frameworks, such as the case/control designs, to actually assess the association between alleles in a particular genomic region and a trait. In this paper we describe methods for testing associations between estimated haplotype frequencies derived from multilocus genotype data and disease endpoints assuming a simple case/control sampling design. These proposed methods overcome the lack of phase information usually associated with samples of unrelated individuals and provide a comprehensive way of assessing the relationship between sequence or multiple-site variation and traits and diseases within populations. We applied the proposed methods in a study of the relationship between polymorphisms within the APOE gene region and Alzheimer's disease. Cases and controls for this study were collected from the United States and France. Our results confirm the known association between the APOE locus and Alzheimer's disease, even when the epsilon 4 polymorphism is not contained in the tested haplotypes. This suggests that, in certain situations, haplotype information and linkage disequilibrium-induced associations between polymorphic loci that neighbor loci harboring functional sequence variants can be exploited to identify disease-predisposing alleles in large, freely mixing populations via estimated haplotype frequency methods.

Journal ArticleDOI
TL;DR: A new gene identification algorithm, GenomeScan, which combines exon-intron and splice signal models with similarity to known protein sequences in an integrated model, which shows an accurate and efficient automated approach for identifying genes in higher eukaryotic genomes and provide a first-level annotation of the draft human genome.
Abstract: With the human genome sequence approaching completion, a major challenge is to identify the locations and encoded protein sequences of all human genes. To address this problem we have developed a new gene identification algorithm, GenomeScan, which combines exon–intron and splice signal models with similarity to known protein sequences in an integrated model. Extensive testing shows that GenomeScan can accurately identify the exon–intron structures of genes in finished or draft human genome sequence with a low rate of false-positives. Application of GenomeScan to 2.7 billion bases of human genomic DNA identified at least 20,000–25,000 human genes out of an estimated 30,000–40,000 present in the genome. The results show an accurate and efficient automated approach for identifying genes in higher eukaryotic genomes and provide a first-level annotation of the draft human genome.

Journal ArticleDOI
TL;DR: An extensive search for bHLH sequences in the completely sequenced genomes of Caenorhabditis elegans and Drosophila melanogaster found 35 and 56 different genes, respectively, which may represent the complete set of b HLH of these organisms.
Abstract: The basic Helix-Loop-Helix (bHLH) proteins are transcription factors that play important roles during the development of various metazoans including fly, nematode, and vertebrates. They are also involved in human diseases, particularly in cancerogenesis. We made an extensive search for bHLH sequences in the completely sequenced genomes of Caenorhabditis elegans and of Drosophila melanogaster. We found 35 and 56 different genes, respectively, which may represent the complete set of bHLH of these organisms. A phylogenetic analysis of these genes, together with a large number (>350) of bHLH from other sources, led us to define 44 orthologous families among which 36 include bHLH from animals only, and two have representatives in both yeasts and animals. In addition, we identified two bHLH motifs present only in yeast, and four that are present only in plants; however, the latter number is certainly an underestimate. Most animal families (35/38) comprise fly, nematode, and vertebrate genes, suggesting that their common ancestor, which lived in pre-Cambrian times (600 million years ago) already owned as many as 35 different bHLH genes.

Journal ArticleDOI
TL;DR: The combinatorial partitioning method (CPM) is presented that examines multiple genes, each containing multiple variable loci, to identify partitions of multilocus genotypes that predict interindividual variation in quantitative trait levels and finds that many combinations of loci are involved in sets of genotypic partitions that predict triglyceride variability and that the most predictive sets show nonadditivity.
Abstract: Recent advances in genome research have accelerated the process of locating candidate genes and the variable sites within them and have simplified the task of genotype measurement. The development of statistical and computational strategies to utilize information on hundreds — soon thousands — of variable loci to investigate the relationships between genome variation and phenotypic variation has not kept pace, particularly for quantitative traits that do not follow simple Mendelian patterns of inheritance. We present here the combinatorial partitioning method (CPM) that examines multiple genes, each containing multiple variable loci, to identify partitions of multilocus genotypes that predict interindividual variation in quantitative trait levels. We illustrate this method with an application to plasma triglyceride levels collected on 188 males, ages 20–60 yr, ascertained without regard to health status, from Rochester, Minnesota. Genotype information included measurements at 18 diallelic loci in six coronary heart disease–candidate susceptibility gene regions: APOA1-C3-A4, APOB, APOE, LDLR, LPL, and PON1. To illustrate the CPM, we evaluated all possible partitions of two-locus genotypes into two to nine partitions (∼106 evaluations). We found that many combinations of loci are involved in sets of genotypic partitions that predict triglyceride variability and that the most predictive sets show nonadditivity. These results suggest that traditional methods of building multilocus models that rely on statistically significant marginal, single-locus effects, may fail to identify combinations of loci that best predict trait variability. The CPM offers a strategy for exploring the high-dimensional genotype state space so as to predict the quantitative trait variation in the population at large that does not require the conditioning of the analysis on a prespecified genetic model.

Journal ArticleDOI
TL;DR: PCR conditions that permit the use of the TaqMan or 5' nuclease allelic discrimination assay for typing large numbers of individuals with any SNP and computational methods that allow genotypes to be assigned automatically are described.
Abstract: To make large-scale association studies a reality, automated high-throughput methods for genotyping with single-nucleotide polymorphisms (SNPs) are needed. We describe PCR conditions that permit the use of the TaqMan or 5′ nuclease allelic discrimination assay for typing large numbers of individuals with any SNP and computational methods that allow genotypes to be assigned automatically. To demonstrate the utility of these methods, we typed >1600 individuals for a G-to-T transversion that results in a glutamate-to-aspartate substitution at position 298 in the endothelial nitric oxide synthase gene, and a G/C polymorphism (newly identified in our laboratory) in intron 8 of the 11–β hydroxylase gene. The genotyping method is accurate—we estimate an error rate of fewer than 1 in 2000 genotypes, rapid—with five 96-well PCR machines, one fluorescent reader, and no automated pipetting, over one thousand genotypes can be generated by one person in one day, and flexible—a new SNP can be tested for association in less than one week. Indeed, large-scale genotyping has been accomplished for 23 other SNPs in 13 different genes using this method. In addition, we identified three “pseudo-SNPs” (WIAF1161, WIAF2566, and WIAF335) that are probably a result of duplication.

Journal ArticleDOI
TL;DR: It is concluded that Indian castes are most likely to be of proto-Asian origin with West Eurasian admixture resulting in rank-related and sex-specific differences in the genetic affinities of castes to Asians and Europeans.
Abstract: The origins and affinities of the ∼1 billion people living on the subcontinent of India have long been contested. This is owing, in part, to the many different waves of immigrants that have influenced the genetic structure of India. In the most recent of these waves, Indo-European-speaking people from West Eurasia entered India from the Northwest and diffused throughout the subcontinent. They purportedly admixed with or displaced indigenous Dravidic-speaking populations. Subsequently they may have established the Hindu caste system and placed themselves primarily in castes of higher rank. To explore the impact of West Eurasians on contemporary Indian caste populations, we compared mtDNA (400 bp of hypervariable region 1 and 14 restriction site polymorphisms) and Y-chromosome (20 biallelic polymorphisms and 5 short tandem repeats) variation in ∼265 males from eight castes of different rank to ∼750 Africans, Asians, Europeans, and other Indians. For maternally inherited mtDNA, each caste is most similar to Asians. However, 20%–30% of Indian mtDNA haplotypes belong to West Eurasian haplogroups, and the frequency of these haplotypes is proportional to caste rank, the highest frequency of West Eurasian haplotypes being found in the upper castes. In contrast, for paternally inherited Y-chromosome variation each caste is more similar to Europeans than to Asians. Moreover, the affinity to Europeans is proportionate to caste rank, the upper castes being most similar to Europeans, particularly East Europeans. These findings are consistent with greater West Eurasian male admixture with castes of higher rank. Nevertheless, the mitochondrial genome and the Y chromosome each represents only a single haploid locus and is more susceptible to large stochastic variation, bottlenecks, and selective sweeps. Thus, to increase the power of our analysis, we assayed 40 independent, biparentally inherited autosomal loci (1 LINE-1 and 39 Alu elements) in all of the caste and continental populations (∼600 individuals). Analysis of these data demonstrated that the upper castes have a higher affinity to Europeans than to Asians, and the upper castes are significantly more similar to Europeans than are the lower castes. Collectively, all five datasets show a trend toward upper castes being more similar to Europeans, whereas lower castes are more similar to Asians. We conclude that Indian castes are most likely to be of proto-Asian origin with West Eurasian admixture resulting in rank-related and sex-specific differences in the genetic affinities of castes to Asians and Europeans.

Journal ArticleDOI
TL;DR: This comparison shows that the Autofinish-Hybrid method of finishing against a human finisher in five different projects with a variety of shotgun depths by finishing each project twice, while using roughly the same number and type of reads and closing gaps atrough the same rate.
Abstract: Currently, the genome sequencing community is producing shotgun sequence data at a very high rate, but finishing (collecting additional directed sequence data to close gaps and improve the quality of the data) is not matching that rate. One reason for the difference is that shotgun sequencing is highly automated but finishing is not: Most finishing decisions, such as which directed reads to obtain and which specialized sequencing techniques to use, are made by people. If finishing rates are to increase to match shotgun sequencing rates, most finishing decisions also must be automated. The Autofinish computer program (which is part of the computer software package) does this by automatically choosing finishing reads. Autofinish is able to suggest most finishing reads required for completion of each sequencing project, greatly reducing the amount of human attention needed. sometimes completely finishes the project, with no human decisions required. It cannot solve the most complex problems, so we recommend that Autofinish be allowed to suggest reads for the first three rounds of finishing, and if the project still is not finished completely, a human finisher complete the work. We compared this Autofinish-Hybrid method of finishing against a human finisher in five different projects with a variety of shotgun depths by finishing each project twice--once with each method. This comparison shows that the Autofinish-Hybrid method saves many hours over a human finisher alone, while using roughly the same number and type of reads and closing gaps at roughly the same rate. Autofinish currently is in production use at several large sequencing centers. It is designed to be adaptable to the finishing strategy of the lab--it can finish using some or all of the following: resequencing reads, reverses, custom primer walks on either subclone templates or whole clone templates, PCR, or minilibraries. Autofinish has been used for finishing cDNA, genomic clones, and whole bacterial genomes (see http://www.phrap.org).

Journal ArticleDOI
TL;DR: In the present studies, proteomic approaches were used to define the extracellular complement of the B. subtilis secretome and show that genome-based predictions reflect the actual composition of theextracellular proteome for approximately 50%.
Abstract: The availability of complete genome sequences has allowed the prediction of all exported proteins of the corresponding organisms with dedicated algorithms. Even though numerous studies report on genome-based predictions of signal peptides and cell retention signals, they lack a proteomic verification. For example, 180 secretory and 114 lipoprotein signal peptides were predicted recently for the Gram-positive eubacterium Bacillus subtilis. In the present studies, proteomic approaches were used to define the extracellular complement of the B. subtilis secretome. Using different growth conditions and a hyper-secreting mutant, approximately 200 extracellular proteins were visualized by two-dimensional (2D) gel electrophoresis, of which 82 were identified by mass spectrometry. These include 41 proteins that have a potential signal peptide with a type I signal peptidase (SPase) cleavage site, and lack a retention signal. Strikingly, the remaining 41 proteins were predicted previously to be cell associated because of the apparent absence of a signal peptide (22), or the presence of specific cell retention signals in addition to an export signal (19). To test the importance of the five type I SPases and the unique lipoprotein-specific SPase of B. subtilis, the extracellular proteome of (multiple) SPase mutants was analyzed. Surprisingly, only the processing of the polytopic membrane protein YfnI was strongly inhibited in Spase I mutants, showing for the first time that a native eubacterial membrane protein is a genuine Spase I substrate. Furthermore, a mutation affecting lipoprotein modification and processing resulted in the shedding of at least 23 (lipo-)proteins into the medium. In conclusion, our observations show that genome-based predictions reflect the actual composition of the extracellular proteome for approximately 50%. Major problems are currently encountered with the prediction of extracellular proteins lacking signal peptides (including cytoplasmic proteins) and lipoproteins.

Journal ArticleDOI
TL;DR: The new method is well suited for high-throughput, automated genotyping because it requires only one reaction per SNP, it is performed in a single tube with no post-PCR handling, the same energy-transfer-labeled primers are used for all analyses, and the instrumentation is inexpensive.
Abstract: We have developed a new method for high-throughput genotyping of single nucleotide polymorphisms (SNPs). The technique involves PCR amplification of genomic DNA with two tailed allele-specific primers that introduce priming sites for universal energy-transfer-labeled primers. The output of red and green light is conveniently scored using a fluorescence plate reader. The new method, which was validated on nine model SNPs, is well suited for high-throughput, automated genotyping because it requires only one reaction per SNP, it is performed in a single tube with no post-PCR handling, the same energy-transfer-labeled primers are used for all analyses, and the instrumentation is inexpensive. Possible applications include multiple-candidate gene analysis, genomewide scans, and medical diagnostics.

Journal ArticleDOI
TL;DR: A group of genes whose expression profiles correlated with that of thrombopoietin are identified and found that geneswhose expression associated with AML treatment outcome lie in recurrent chromosomal locations.
Abstract: We have developed a statistical regression modeling approach to discover genes that are differentially expressed between two predefined sample groups in DNA microarray experiments. Our model is based on well-defined assumptions, uses rigorous and well-characterized statistical measures, and accounts for the heterogeneity and genomic complexity of the data. In contrast to cluster analysis, which attempts to define groups of genes and/or samples that share common overall expression profiles, our modeling approach uses known sample group membership to focus on expression profiles of individual genes in a sensitive and robust manner. Further, this approach can be used to test statistical hypotheses about gene expression. To demonstrate this methodology, we compared the expression profiles of 11 acute myeloid leukemia (AML) and 27 acute lymphoblastic leukemia (ALL) samples from a previous study (Golub et al. 1999) and found 141 genes differentially expressed between AML and ALL with a 1% significance at the genomic level. Using this modeling approach to compare different sample groups within the AML samples, we identified a group of genes whose expression profiles correlated with that of thrombopoietin and found that genes whose expression associated with AML treatment outcome lie in recurrent chromosomal locations. Our results are compared with those obtained using t-tests or Wilcoxon rank sum statistics.

Journal ArticleDOI
TL;DR: It appears that even with such a large number of nucleotide characters (11,592), limited taxon sampling can lead to problems associated with extensive evolution on long phyletic branches.
Abstract: We describe the complete sequence of the 16,596-nucleotide mitochondrial genome of the zebrafish (Danio rerio); contained are 13 protein genes, 22 tRNAs, 2 rRNAs, and a noncoding control region. Codon usage in protein genes is generally biased toward the available tRNA species but also reflects strand-specific nucleotide frequencies. For 19 of the 20 amino acids, the most frequently used codon ends in either A or C, with A preferred over C for fourfold degenerate codons (the lone exception was AUG: methionine). We show that rates of sequence evolution vary nearly as much within vertebrate classes as between them, yet nucleotide and amino acid composition show directional evolutionary trends, including marked differences between mammals and all other taxa. Birds showed similar compositional characteristics to the other nonmammalian taxa, indicating that the evolutionary trend in mammals is not solely due to metabolic rate and thermoregulatory factors. Complete mitochondrial genomes provide a large character base for phylogenetic analysis and may provide for robust estimates of phylogeny. Phylogenetic analysis of zebrafish and 35 other taxa based on all protein-coding genes produced trees largely, but not completely, consistent with conventional views of vertebrate evolution. It appears that even with such a large number of nucleotide characters (11,592), limited taxon sampling can lead to problems associated with extensive evolution on long phyletic branches.

Journal ArticleDOI
TL;DR: Gene order conservation among prokaryotic genomes was compared to the cooccurrence of genomes in clusters of orthologous genes (COGs) and to the conservation of protein sequences themselves, and the potential of using template-anchored multiple-genome alignments for predicting functions of uncharacterized genes was quantitatively assessed.
Abstract: Gene order in prokaryotes is conserved to a much lesser extent than protein sequences. Only several operons, primarily those that code for physically interacting proteins, are conserved in all or most of the bacterial and archaeal genomes. Nevertheless, even the limited conservation of operon organization that is observed can provide valuable evolutionary and functional clues through multiple genome comparisons. A program for constructing gapped local alignments of conserved gene strings in two genomes was developed. The statistical significance of the local alignments was assessed using Monte Carlo simulations. Sets of local alignments were generated for all pairs of completely sequenced bacterial and archaeal genomes, and for each genome a template-anchored multiple alignment was constructed. In most pairwise genome comparisons, <10% of the genes in each genome belonged to conserved gene strings. When closely related pairs of species (i.e., two mycoplasmas) are excluded, the total coverage of genomes by conserved gene strings ranged from <5% for the cyanobacterium Synechocystis sp to 24% for the minimal genome of Mycoplasma genitalium, and 23% in Thermotoga maritima. The coverage of the archaeal genomes was only slightly lower than that of bacterial genomes. The majority of the conserved gene strings are known operons, with the ribosomal superoperon being the top-scoring string in most genome comparisons. However, in some of the bacterial-archaeal pairs, the superoperon is rearranged to the extent that other operons, primarily those subject to horizontal transfer, show the greatest level of conservation, such as the archaeal-type H+-ATPase operon or ABC-type transport cassettes. The level of gene order conservation among prokaryotic genomes was compared to the cooccurrence of genomes in clusters of orthologous genes (COGs) and to the conservation of protein sequences themselves. Only limited correlation was observed between these evolutionary variables. Gene order conservation shows a much lower variance than the cooccurrence of genomes in COGs, which indicates that intragenome homogenization via recombination occurs in evolution much faster than intergenome homogenization via horizontal gene transfer and lineage-specific gene loss. The potential of using template-anchored multiple-genome alignments for predicting functions of uncharacterized genes was quantitatively assessed. Functions were predicted or significantly clarified for approximately 90 COGs (approximately 4% of the total of 2414 analyzed COGs). The most significant predictions were obtained for the poorly characterized archaeal genomes; these include a previously uncharacterized restriction-modification system, a nuclease-helicase combination implicated in DNA repair, and the probable archaeal counterpart of the eukaryotic exosome. Multiple genome alignments are a resource for studies on operon rearrangement and disruption, which is central to our understanding of the evolution of prokaryotic genomes. Because of the rapid evolution of the gene order, the potential of genome alignment for prediction of gene functions is limited, but nevertheless, such predictions information significantly complements the results obtained through protein sequence and structure analysis.

Journal ArticleDOI
TL;DR: In this article, a set-association method was developed to blend relevant sources of information such as allelic association and Hardy-Weinberg disequilibrium to detect association to sets of SNP markers in different genes.
Abstract: The search for genes underlying complex traits has been difficult and often disappointing. The main reason for these difficulties is that several genes, each with rather small effect, might be interacting to produce the trait. Therefore, we must search the whole genome for a good chance to find these genes. Doing this with tens of thousands of SNP markers, however, greatly increases the overall probability of false-positive results, and current methods limiting such error probabilities to acceptable levels tend to reduce the power of detecting weak genes. Investigating large numbers of SNPs inevitably introduces errors (e.g., in genotyping), which will distort analysis results. Here we propose a simple strategy that circumvents many of these problems. We develop a set-association method to blend relevant sources of information such as allelic association and Hardy-Weinberg disequilibrium. Information is combined over multiple markers and genes in the genome, quality control is improved by trimming, and an appropriate testing strategy limits the overall false-positive rate. In contrast to other available methods, our method to detect association to sets of SNP markers in different genes in a real data application has shown remarkable success.

Journal ArticleDOI
TL;DR: These studies reveal extensive genetic diversity among C. jejuni strains and pave the way toward identifying correlates of pathogenicity and developing improved epidemiological tools for this problematic pathogen.
Abstract: Campylobacter jejuni is the leading cause of bacterial food-borne diarrhoeal disease throughout the world, and yet is still a poorly understood pathogen. Whole genome microarray comparisons of 11 C. jejuni strains of diverse origin identified genes in up to 30 NCTC 11168 loci ranging from 0.7 to 18.7 kb that are either absent or highly divergent in these isolates. Many of these regions are associated with the biosynthesis of surface structures including flagella, lipo-oligosaccharide, and the newly identified capsule. Other strain-variable genes of known function include those responsible for iron acquisition, DNA restriction/modification, and sialylation. In fact, at least 21% of genes in the sequenced strain appear dispensable as they are absent or highly divergent in one or more of the isolates tested, thus defining 1300 C. jejuni core genes. Such core genes contribute mainly to metabolic, biosynthetic, cellular, and regulatory processes, but many virulence determinants are also conserved. Comparison of the capsule biosynthesis locus revealed conservation of all the genes in this region in strains with the same Penner serotype as strain NCTC 11168. By contrast, between 5 and 17 NCTC 11168 genes in this region are either absent or highly divergent in strains of a different serotype from the sequenced strain, providing further evidence that the capsule accounts for Penner serotype specificity. These studies reveal extensive genetic diversity among C. jejuni strains and pave the way toward identifying correlates of pathogenicity and developing improved epidemiological tools for this problematic pathogen.

Journal ArticleDOI
TL;DR: A computer program that aligns spliced sequences to genomic sequences, using local alignment algorithms and heuristics to put together a global spliced alignment, which shows how Spidey was used to align reference sequences to known genomic sequences and then to the draft human genome.
Abstract: We have developed a computer program that aligns spliced sequences to genomic sequences, using local alignment algorithms and heuristics to put together a global spliced alignment. Spidey can produce reliable alignments quickly, even when confronted with noise from alternative splicing, polymorphisms, sequencing errors, or evolutionary divergence. We show how Spidey was used to align reference sequences to known genomic sequences and then to the draft human genome, to align mRNAs to gene clusters, and to align mouse mRNAs to human genomic sequence. We compared Spidey to two other spliced alignment programs; Spidey generally performed quite well in a very reasonable amount of time.