scispace - formally typeset
Search or ask a question

Showing papers on "GenBank published in 2006"


Journal ArticleDOI
TL;DR: Although UCSC Known Genes offers the highest genomic and CDS coverage among major human and mouse gene sets, more detailed analysis suggests all of them could be further improved.
Abstract: The University of California Santa Cruz (UCSC) Known Genes dataset is constructed by a fully automated process, based on protein data from Swiss-Prot/TrEMBL (UniProt) and the associated mRNA data from Genbank. The detailed steps of this process are described. Extensive cross-references from this dataset to other genomic and proteomic data were constructed. For each known gene, a details page is provided containing rich information about the gene, together with extensive links to other relevant genomic, proteomic and pathway data. As of July 2005, the UCSC Known Genes are available for human, mouse and rat genomes. The Known Genes serves as a foundation to support several key programs: the Genome Browser, Proteome Browser, Gene Sorter and Table Browser offered at the UCSC website. All the associated data files and program source code are also available. They can be accessed at http://genome.ucsc.edu. The genomic coverage of UCSC Known Genes, RefSeq, Ensembl Genes, H-Invitational and CCDS is analyzed. Although UCSC Known Genes offers the highest genomic and CDS coverage among major human and mouse gene sets, more detailed analysis suggests all of them could be further improved. Contact: fanhsu@soe.ucsc.edu

507 citations


Journal ArticleDOI
13 Apr 2006-Botany
TL;DR: Analysis of over 10 000 rbcL sequences from GenBank demonstrate that this locus could serve well as the core region, with sufficient variation to discriminate among species in approximate ranges, and a tiered approach wherein highly variable loci are nested under a core barcoding gene is proposed.
Abstract: DNA barcoding based on the mitochondrial cytochrome c oxidase 1 (cox1) sequence is being employed for diverse groups of animals with demonstrated success in species identification and new species discovery. Applying barcoding systems to land plants will be a more challenging task as plant genome substitution rates are considerably lower than those observed in animal mitochondria, suggesting that a much greater amount of sequence data from multiple loci will be required to barcode plants. In the absence of an obvious well-characterized plant locus that meets all the necessary criteria, a key first step will be identifying candidate regions with the most potential. To meet the challenges with land plants, we are proposing the adoption of a tiered approach wherein highly variable loci are nested under a core barcoding gene. Analysis of over 10 000 rbcL sequences from GenBank demonstrate that this locus could serve well as the core region, with sufficient variation to discriminate among species in approximate...

336 citations


Journal ArticleDOI
TL;DR: Identification of medically important yeasts by ITS sequencing, especially using the ITS2 region, is reliable and can be used as an accurate alternative to conventional identification methods.
Abstract: Infections caused by yeasts have increased in previous decades due primarily to the increasing population of immunocompromised patients. In addition, infections caused by less common species such as Pichia, Rhodotorula, Trichosporon, and Saccharomyces spp. have been widely reported. This study extensively evaluated the feasibility of sequence analysis of the rRNA gene internal transcribed spacer (ITS) regions for the identification of yeasts of clinical relevance. Both the ITS1 and ITS2 regions of 373 strains (86 species), including 299 reference strains and 74 clinical isolates, were amplified by PCR and sequenced. The sequences were compared to reference data available at the GenBank database by using BLAST (basic local alignment search tool) to determine if species identification was possible by ITS sequencing. Since the GenBank database currently lacks ITS sequence entries for some yeasts, the ITS sequences of type (or reference) strains of 15 species were submitted to GenBank to facilitate identification of these species. Strains producing discrepant identifications between the conventional methods and ITS sequence analysis were further analyzed by sequencing of the D1-D2 domain of the large-subunit rRNA gene for species clarification. The rates of correct identification by ITS1 and ITS2 sequence analysis were 96.8% (361/373) and 99.7% (372/373), respectively. Of the 373 strains tested, only 1 strain (Rhodotorula glutinis BCRC 20576) could not be identified by ITS2 sequence analysis. In conclusion, identification of medically important yeasts by ITS sequencing, especially using the ITS2 region, is reliable and can be used as an accurate alternative to conventional identification methods.

289 citations


Journal ArticleDOI
TL;DR: Whole-genome amplification of metagenomic DNA from very minute microbial sources, while introducing an amplification bias, will allow access to genomic information that was not previously accessible.
Abstract: Low-biomass samples from nitrate and heavy metal contaminated soils yield DNA amounts that have limited use for direct, native analysis and screening. Multiple displacement amplification (MDA) using phi29 DNA polymerase was used to amplify whole genomes from environmental, contaminated, subsurface sediments. By first amplifying the genomic DNA (gDNA), biodiversity analysis and gDNA library construction of microbes found in contaminated soils were made possible. The MDA method was validated by analyzing amplified genome coverage from approximately five Escherichia coli cells, resulting in 99.2% genome coverage. The method was further validated by confirming overall representative species coverage and also an amplification bias when amplifying from a mix of eight known bacterial strains. We extracted DNA from samples with extremely low cell densities from a U.S. Department of Energy contaminated site. After amplification, small-subunit rRNA analysis revealed relatively even distribution of species across several major phyla. Clone libraries were constructed from the amplified gDNA, and a small subset of clones was used for shotgun sequencing. BLAST analysis of the library clone sequences showed that 64.9% of the sequences had significant similarities to known proteins, and "clusters of orthologous groups" (COG) analysis revealed that more than half of the sequences from each library contained sequence similarity to known proteins. The libraries can be readily screened for native genes or any target of interest. Whole-genome amplification of metagenomic DNA from very minute microbial sources, while introducing an amplification bias, will allow access to genomic information that was not previously accessible. The reported SSU rRNA sequences and library clone end sequences are listed with their respective GenBank accession numbers, DQ 404590 to DQ 404652, DQ 404654 to DQ 404938, and DX 385314 to DX 389173.

234 citations


Journal ArticleDOI
TL;DR: GATU greatly simplifies the initial stages of genome annotation by using a closely related genome as a reference and significantly reduces the time required for annotation of genes and mature peptides as well as helping to standardize gene names between related organisms.
Abstract: Background Since DNA sequencing has become easier and cheaper, an increasing number of closely related viral genomes have been sequenced. However, many of these have been deposited in GenBank without annotations, severely limiting their value to researchers. While maintaining comprehensive genomic databases for a set of virus families at the Viral Bioinformatics Resource Center http://www.biovirus.org and Viral Bioinformatics – Canada http://www.virology.ca, we found that researchers were unnecessarily spending time annotating viral genomes that were close relatives of already annotated viruses. We have therefore designed and implemented a novel tool, Genome Annotation Transfer Utility (GATU), to transfer annotations from a previously annotated reference genome to a new target genome, thereby greatly reducing this laborious task.

208 citations


Journal ArticleDOI
TL;DR: BioWarehouse integrates its component databases into a common representational framework within a single database management system, thus enabling multi-database queries using the Structured Query Language (SQL) but also facilitating a variety of database integration tasks such as comparative analysis and data mining.
Abstract: This article addresses the problem of interoperation of heterogeneous bioinformatics databases. We introduce BioWarehouse, an open source toolkit for constructing bioinformatics database warehouses using the MySQL and Oracle relational database managers. BioWarehouse integrates its component databases into a common representational framework within a single database management system, thus enabling multi-database queries using the Structured Query Language (SQL) but also facilitating a variety of database integration tasks such as comparative analysis and data mining. BioWarehouse currently supports the integration of a pathway-centric set of databases including ENZYME, KEGG, and BioCyc, and in addition the UniProt, GenBank, NCBI Taxonomy, and CMR databases, and the Gene Ontology. Loader tools, written in the C and JAVA languages, parse and load these databases into a relational database schema. The loaders also apply a degree of semantic normalization to their respective source data, decreasing semantic heterogeneity. The schema supports the following bioinformatics datatypes: chemical compounds, biochemical reactions, metabolic pathways, proteins, genes, nucleic acid sequences, features on protein and nucleic-acid sequences, organisms, organism taxonomies, and controlled vocabularies. As an application example, we applied BioWarehouse to determine the fraction of biochemically characterized enzyme activities for which no sequences exist in the public sequence databases. The answer is that no sequence exists for 36% of enzyme activities for which EC numbers have been assigned. These gaps in sequence data significantly limit the accuracy of genome annotation and metabolic pathway prediction, and are a barrier for metabolic engineering. Complex queries of this type provide examples of the value of the data warehousing approach to bioinformatics research. BioWarehouse embodies significant progress on the database integration problem for bioinformatics.

195 citations


Journal ArticleDOI
TL;DR: Overall, the dense supermatrix appeared to provide much more satisfying results, indicated by better resolution of the bootstrap tree, excellent agreement with the backbone papilionoid tree in the reduced bootstrap consensus analysis, few problematic large polytomies in the strict consensus, and less fragmentation of conventionally recognized genera.
Abstract: A comprehensive phylogeny of papilionoid legumes was inferred from sequences of 2228 taxa in GenBank release 147. A semiautomated analysis pipeline was constructed to download, parse, assemble, align, combine, and build trees from a pool of 11,881 sequences. Initial steps included all-against-all BLAST similarity searches coupled with assembly, using a novel strategy for building length-homogeneous primary sequence clusters. This was followed by a combination of global and local alignment protocols to build larger secondary clusters of locally aligned sequences, thus taking into account the dramatic differences in length of the heterogeneous coding and noncoding sequence data present in GenBank. Next, clusters were checked for the presence of duplicate genes and other potentially misleading sequences and examined for combinability with other clusters on the basis of taxon overlap. Finally, two supermatrices were constructed: a "sparse" matrix based on the primary clusters alone (1794 taxa x 53,977 characters), and a somewhat more "dense" matrix based on the secondary clusters (2228 taxa x 33,168 characters). Both matrices were very sparse, with 95% of their cells containing gaps or question marks. These were subjected to extensive heuristic parsimony analyses using deterministic and stochastic heuristics, including bootstrap analyses. A "reduced consensus" bootstrap analysis was also performed to detect cryptic signal in a subtree of the data set corresponding to a "backbone" phylogeny proposed in previous studies. Overall, the dense supermatrix appeared to provide much more satisfying results, indicated by better resolution of the bootstrap tree, excellent agreement with the backbone papilionoid tree in the reduced bootstrap consensus analysis, few problematic large polytomies in the strict consensus, and less fragmentation of conventionally recognized genera. Nevertheless, at lower taxonomic levels several problems were identified and diagnosed. A large number of methodological issues in supermatrix construction at this scale are discussed, including detection of annotation errors in GenBank sequences; the shortage of effective algorithms and software for local multiple sequence alignment; the difficulty of overcoming effects of fragmentation of data into nearly disjoint blocks in sparse supermatrices; and the lack of informative tools to assess confidence limits in very large trees.

168 citations


Journal ArticleDOI
20 Jun 2006-Virology
TL;DR: Their particle dimensions, host-ranges and buoyant densities were found to be similar but they differed in their stabilities to raised temperature, low salinity and chloroform, suggesting a high frequency of recombination between haloviruses and their hosts.

142 citations


Journal ArticleDOI
TL;DR: To demonstrate the utility of this EST collection for studying cell wall composition, homologs for the genes involved in the biosynthesis of lignin subunits were identified and reinforced the close relationship of Brachypodium to wheat and barley.
Abstract: Brachypodium distachyon (Brachypodium) is a temperate grass with the physical and genomic attributes necessary for a model system (small size, rapid generation time, self-fertile, small genome size, diploidy in some accessions). To increase the utility of Brachypodium as a model grass, we sequenced 20,440 expressed sequence tags (ESTs) from five cDNA libraries made from leaves, stems plus leaf sheaths, roots, callus and developing seed heads. The ESTs had an average trimmed length of 650 bp. Blast nucleotide alignments against SwissProt and GenBank non-redundant databases were performed and a total of 99.9% of the ESTs were found to have some similarity to existing protein or nucleotide sequences. Tentative functional classification of 77% of the sequences was possible by association with gene ontology or clusters of orthologous group’s index descriptors. To demonstrate the utility of this EST collection for studying cell wall composition, we identified homologs for the genes involved in the biosynthesis of lignin subunits. A subset of the ESTs was used for phylogenetic analysis that reinforced the close relationship of Brachypodium to wheat and barley.

131 citations


Journal ArticleDOI
TL;DR: The construction of a full-length enriched cDNA library from osmotically stressed maize seedlings by using the modified CAP trapper method is reported and the data suggest that an ethylene signaling pathway may be involved in the maize response to drought stress.
Abstract: Full-length cDNAs are very important for genome annotation and functional analysis of genes. The number of full-length cDNAs from maize (Zea mays L.) remains limited. Here we report the construction of a full-length enriched cDNA library from osmotically stressed maize seedlings by using the modified CAP trapper method. From this library, 2073 full-length cDNAs were collected and further analyzed by sequencing from both the 5'- and 3'-ends. A total of 1728 (83.4%) sequences did not match known maize mRNA and full-length cDNA sequences in the GenBank database and represent new full-length genes. After alignment of the 2073 full-length cDNAs with 448 maize BAC sequences, it was found that 84 full-length cDNAs could be mapped to the BACs. Of these, 43 genes (51.2%) have been correctly annotated from the BAC clones, 37 genes (44.0%) have been annotated with a different exon-intron structure from our cDNA, and four genes (4.76%) had no annotations in the TIGR database. Expression analysis of 2073 full-length maize cDNAs using a cDNA macroarray led to the identification of 79 genes upregulated by stress treatments and 329 downregulated genes. Of the 79 stress-inducible genes, 30 genes contain ABRE, DRE, MYB, MYC core sequences or other abiotic-responsive cis-acting elements in their promoters. These results suggest that these cis-acting elements and the corresponding transcription factors take part in plant responses to osmotic stress either cooperatively or independently. Additionally, the data suggest that an ethylene signaling pathway may be involved in the maize response to drought stress.

103 citations


Journal ArticleDOI
TL;DR: The observations suggest that evolution of the T4-like phages has drawn on a highly diverged pool of genes in the microbial world and harbour a wealth of genetic material that has not been identified previously.
Abstract: Bacteriophages are an important repository of genetic diversity. As one of the major constituents of terrestrial biomass, they exert profound effects on the earth's ecology and microbial evolution by mediating horizontal gene transfer between bacteria and controlling their growth. Only limited genomic sequence data are currently available for phages but even this reveals an overwhelming diversity in their gene sequences and genomes. The contribution of the T4-like phages to this overall phage diversity is difficult to assess, since only a few examples of complete genome sequence exist for these phages. Our analysis of five T4-like genomes represents half of the known T4-like genomes in GenBank. Here, we have examined in detail the genetic diversity of the genomes of five relatives of bacteriophage T4: the Escherichia coli phages RB43, RB49 and RB69, the Aeromonas salmonicida phage 44RR2.8t (or 44RR) and the Aeromonas hydrophila phage Aeh1. Our data define a core set of conserved genes common to these genomes as well as hundreds of additional open reading frames (ORFs) that are nonconserved. Although some of these ORFs resemble known genes from bacterial hosts or other phages, most show no significant similarity to any known sequence in the databases. The five genomes analyzed here all have similarities in gene regulation to T4. Sequence motifs resembling T4 early and late consensus promoters were observed in all five genomes. In contrast, only two of these genomes, RB69 and 44RR, showed similarities to T4 middle-mode promoter sequences and to the T4 motA gene product required for their recognition. In addition, we observed that each phage differed in the number and assortment of putative genes encoding host-like metabolic enzymes, tRNA species, and homing endonucleases. Our observations suggest that evolution of the T4-like phages has drawn on a highly diverged pool of genes in the microbial world. The T4-like phages harbour a wealth of genetic material that has not been identified previously. The mechanisms by which these genes may have arisen may differ from those previously proposed for the evolution of other bacteriophage genomes.

Journal ArticleDOI
TL;DR: The prevalence study and the characterization of hepatitis C virus (HCV) was carried out in the Philippines and the sequence determination of the 5′‐untranslated region (5′‐UTR)‐Core and the NS5B regions of HCV was carriedOut in this study.
Abstract: The prevalence study and the characterization of hepatitis C virus (HCV) was carried out in the Philippines and the sequence determination of the 5'-untranslated region (5'-UTR)-Core and the NS5B regions of HCV was carried out in this study. An HCV strain (SE-03-07-1689) collected in Metro Manila, Philippines, belonged to discordant subtypes, 2b and 1b in 5'-UTR-Core and NS5B regions, respectively. The 9.3 kb sequence of this strain including the entire open reading frame was compared with those of the reference strains retrieved from the HCV sequences database (GenBank/EMBL/DDBJ) and indicated a recombination event. The computation of the sequence similarity mapped a crossover point within the NS3 region. This is the second report on the inter-genotype recombinant of HCV and the third when an intra-genotype recombinant is included. This recombinant strain, SE-03-07-1689, is designated tentatively as RF3_2b/1b according to the suggestions used for the other two HCV recombinants.

Journal ArticleDOI
TL;DR: A SMART cDNA library from spleen of large yellow croaker stimulated by poly I:C was constructed and genes involved in immune functions, including complement system components, immunoglobulins, antigen processing and presentation proteins, interferon system proteins, cytokines, and some innate defence molecules were identified.

Journal ArticleDOI
TL;DR: The ChloroplastDB provides unified annotations, gene name search, BLAST and download functions for chloroplast encoded genes and genomic sequences, which greatly facilitates comparative research on sequence evolution including changes in gene content, codon usage, gene structure and post-transcriptional modifications such as RNA editing.
Abstract: The Chloroplast Genome Database (ChloroplastDB) is an interactive, web-based database for fully sequenced plastid genomes, containing genomic, protein, DNA and RNA sequences, gene locations, RNA-editing sites, putative protein families and alignments (http://chloroplast.cbio.psu.edu/). With recent technical advances, the rate of generating new organelle genomes has increased dramatically. However, the established ontology for chloroplast genes and gene features has not been uniformly applied to all chloroplast genomes available in the sequence databases. For example, annotations for some published genome sequences have not evolved with gene naming conventions. ChloroplastDB provides unified annotations, gene name search, BLAST and download functions for chloroplast encoded genes and genomic sequences. A user can retrieve all orthologous sequences with one search regardless of gene names in GenBank. This feature alone greatly facilitates comparative research on sequence evolution including changes in gene content, codon usage, gene structure and post-transcriptional modifications such as RNA editing. Orthologous protein sets are classified by TribeMCL and each set is assigned a standard gene name. Over the next few years, as the number of sequenced chloroplast genomes increases rapidly, the tools available in ChloroplastDB will allow researchers to easily identify and compile target data for comparative analysis of chloroplast genes and genomes.

Journal ArticleDOI
TL;DR: The computational approach is applied to map putative Quadruplex forming GRSs within the transcribed regions of a large number of alternatively processed human and mouse gene sequences that were obtained as fully annotated entries from GenBank and RefSeq to build the GRSDB database, which provides a unique avenue for studying G-quadruplexes in the context of RNA processing sites.
Abstract: Guanine-rich nucleic acids are known to form highly stable G-quadruplex structures, also known as G-quartets. Recently, there has been a tremendous amount of interest in studying G-quadruplexes owing to the realization of their biological importance. G-rich sequences (GRSs) capable of forming G-quadruplexes are found in the vicinity of polyadenylation regions and are involved in regulating 3' end processing of mammalian pre-mRNAs. G-rich motifs are also known to play an important role in alternative, tissue-specific splicing by interacting with hnRNP H protein subfamily. Whether quadruplex structure directly plays a role in regulating RNA processing events requires further investigation. To date there has not been a comprehensive effort to study G-quadruplexes near RNA processing sites. We have applied a computational approach to map putative Quadruplex forming GRSs within the transcribed regions of a large number of alternatively processed human and mouse gene sequences that were obtained as fully annotated entries from GenBank and RefSeq. We have used the computed data to build the GRSDB database that provides a unique avenue for studying G-quadruplexes in the context of RNA processing sites. GRSDB website offers visual comparison of G-quadruplex distribution patterns among all the alternative RNA products of a gene with the help of dynamic graphics. At present, GRSDB contains data from 1310 human and mouse genes, of which 1188 are alternatively processed. It has a total of 379 223 predicted G-quadruplexes, of which 54 252 are near RNA processing sites. GRSDB is a good resource for researchers interested in investigating the functional relevance of G-quadruplexes, especially in the context of alternative RNA processing. It can be accessed at http://bioinformatics.ramapo.edu/grsdb/.

Journal ArticleDOI
TL;DR: The study provides a significant number of ESTs for gene discovery and candidate genes for studying host defense in scallops and other molluscs.
Abstract: The bay scallop, Argopecten irradians irradians, introduced from North America, has become one of the most important aquaculture species in China. Inan effort to identify scallop genes involved in host defense, a high-quality cDNA library was constructed from whole body tissues of the bay scallop. A total of 5828 successful sequencing reactions yielded 4995 expressed sequence tags (ESTs) longer than 100 bp. Cluster and assembly analyses of the ESTs identified 637 contigs (consisting of 2853 sequences) and 2142 singletons, totaling 2779 unique sequences. Basic Local Alignment Search Tool (BLAST) analysis showed that the majority (73%) of the unique sequences had no significant homology (E-value >= 0.005) to sequences in GenBank. Among the 748 sequences with significant GenBank matches, 160 (21.4%) were for genes related to metabolism, 131 (17.5%) for cell/organism defense, 124 (16.6%) for gene/protein expression, 83 (11.1%) for cell structure/motility, 70 (9.4%) for cell signaling/communication, 17 (2.3%) for cell division, and 163 (21.8%) matched to genes of unknown functions. The list of host-defense genes included many genes with known and important roles in innate defense such as lectins, defensins, proteases, protease inhibitors, heat shock proteins, antioxidants, and Toll-like receptors. The study provides a significant number of ESTs for gene discovery and candidate genes for studying host defense in scallops and other molluscs.

Journal ArticleDOI
TL;DR: The EMBL Nucleotide Sequence Database at the EMBL European Bioinformatics Institute, UK, offers a comprehensive set of publicly available nucleotide sequence and annotation, freely accessible to all.
Abstract: The EMBL Nucleotide Sequence Database (wwwebiacuk/embl) at the EMBL European Bioinformatics Institute, UK, offers a comprehensive set of publicly available nucleotide sequence and annotation, freely accessible to all Maintained in collaboration with partners DDBJ and GenBank, coverage includes whole genome sequencing project data, directly submitted sequence, sequence recorded in support of patent applications and much more The database continues to offer submission tools, data retrieval facilities and user support In 2005, the volume of data offered has continued to grow exponentially In addition to the newly presented data, the database encompasses a range of new data types generated by novel technologies, offers enhanced presentation and searchability of the data and has greater integration with other data resources offered at the EBI and elsewhere In stride with these developing data types, the database has continued to develop submission and retrieval tools to maximise the information content of submitted data and to offer the simplest possible submission routes for data producers New developments, the submission process, data retrieval and access to support are presented in this paper, along with links to sources of further information

Proceedings ArticleDOI
01 Dec 2006
TL;DR: These absent sequences define the maximum set of potentially lethal oligomers and provide a rational basis for choosing artificial DNA sequences for molecular barcodes, show promise for species identification and environmental characterization based on absence, and identify potential targets for therapeutic intervention and suicide markers.
Abstract: We describe a new publicly available algorithm for identifying absent sequences, and demonstrate its use by listing the smallest oligomers not found in the human genome (human "nullomers"), and those not found in any reported genome or GenBank sequence ("primes"). These absent sequences define the maximum set of potentially lethal oligomers. They also provide a rational basis for choosing artificial DNA sequences for molecular barcodes, show promise for species identification and environmental characterization based on absence, and identify potential targets for therapeutic intervention and suicide markers.

Journal ArticleDOI
TL;DR: The convenience and reliability of this method for the determination of the self-incompatibility (SI) genotype was demonstrated both in sweet cherry cultivars representing alleles S1 to S16 and in individuals from a wild population encompassing S-alleles S17 to S22.
Abstract: This study characterises a series of 12 S-locus haplotype-specific F-box protein genes (SFB) in cherry (Prunus avium) that are likely candidates for the pollen component of gametophytic self-incompatibility in this species. Primers were designed to amplify 12 SFB alleles, including the introns present in the 5′ untranslated region; sequences representing the S-alleles S1, S2, S3, S4, S4′, S5, S6, S7, S10, S12, S13 and S16 were cloned and characterized. [The nucleotide sequences reported in this paper have been submitted to the EMBL/GenBank database under the following accession numbers: PaSFB1 (AY805048), PaSFB2 (AY805049), PaSFB3 (AY805057), PaSFB4 (AY649872), PaSFB4′ (AY649873), PaSFB5 (AY805050), PaSFB6 (AY805051), PaSFB7 (AY805052), PaSFB10 (AY805053), PaSFB12 (AY805054), PaSFB13 (AY805055), PaSFB16 (AY805056).] Though the coding regions of six of these alleles have been reported previously, the intron sequence has previously been reported only for S6. Analysis of the introns revealed sequence and length polymorphisms. A novel, PCR-based method to genotype cultivars and wild accessions was developed which combines fluorescently labelled primers amplifying the intron of SFB with similar primers for the first intron of S-RNase alleles. Intron length polymorphisms were then ascertained using a semi-automated sequencer. The convenience and reliability of this method for the determination of the self-incompatibility (SI) genotype was demonstrated both in sweet cherry cultivars representing alleles S1 to S16 and in individuals from a wild population encompassing S-alleles S17 to S22. This method will greatly expedite SI characterisation in sweet cherry and also facilitate large-scale studies of self-incompatibility in wild cherry and other Prunus populations.

Journal ArticleDOI
TL;DR: Parsimony and Bayesian phylogenetic analyses showed that the PR5 family is paraphyletic in plants, likely arising from 10 genes in a common ancestor to monocots and eudicots.
Abstract: Pathogenesis-related group 5 (PR5) plant proteins include thaumatin, osmotin, and related proteins, many of which have antimicrobial activity. The recent discovery of PR5-like (PR5-L) sequences in nematodes and insects raises questions about their evolutionary relationships. Using complete plant genome data and discovery of multiple insect PR5-L sequences, phylogenetic comparisons among plants and animals were performed. All PR5/PR5-L protein sequences were mined from genome data of a member of each of two main angiosperm groups—the eudicots (Arabidoposis thaliana) and the monocots (Oryza sativa)—and from the Caenorhabditis nematode (C. elegans and C. briggsase). Insect PR5-L sequences were mined from EST databases and GenBank submissions from four insect orders: Coleoptera (Diaprepes abbreviatus and Biphyllus lunatus), Orthoptera (Schistocerca gregaria), Hymenoptera (Lysiphlebus testaceipes), and Hemiptera (Toxoptera citricida). Parsimony and Bayesian phylogenetic analyses showed that the PR5 family is paraphyletic in plants, likely arising from 10 genes in a common ancestor to monocots and eudicots. After evolutionary divergence of monocots and eudicots, PR5 genes increased asymmetrically among the 10 clades. Insects and nematodes contain multiple sequences (seven PR5-Ls in nematodes and at least three in some insects) all related to the same plant clade, with nematode and insect sequences separating as two clades. Protein structural homology modeling showed strong similarity among animal and plant PR5/PR5-Ls, with divergence only in surface-exposed loops. Sequence and structural conservation among PR5/PR5-Ls suggests an important and conserved role throughout the evolutionary divergence of the diverse organisms from which they reside.

Journal ArticleDOI
TL;DR: The construction and normalization of single and multiple tissue chicken cDNA libraries, high-throughput EST sequencing from these libraries, the CAP3 assembly of a chicken gene index from all public ESTs, and the identification of several nonredundant chicken gene sets for production of custom DNA microarrays are described.
Abstract: Its accessibility, unique evolutionary position, and recently assembled genome sequence have advanced the chicken to the forefront of comparative genomics and developmental biology research as a model organism. Several chicken expressed sequence tag (EST) projects have placed the chicken in 10th place for accrued ESTs among all organisms in GenBank. We have completed the single-pass 5'-end sequencing of 37,557 chicken cDNA clones from several single and multiple tissue cDNA libraries and have entered 35,407 EST sequences into GenBank. Our chicken EST sequences and those found in public databases (on July 1, 2004) provided a total of 517,727 public chicken ESTs and mRNAs. These sequences were used in the CAP3 assembly of a chicken gene index composed of 40,850 contigs and 79,192 unassembled singlets. The CAP3 contigs show a 96.7% match to the chicken genome sequence. The University of Delaware (UD) EST collection (43,928 clones) was assembled into 19,237 nonredundant sequences (13,495 contigs and 5,742 unassembled singlets). The UD collection contains 6,223 unique sequences that are not found in other public EST collections but show a 76% match to the chicken genome sequence. Our chicken contig and singlet sequences were annotated according to the highest BlastX and/or BlastN hits. The UD CAP3 contig assemblies and singlets are searchable by nucleotide sequence or key word (http://cogburn.dbi.udel.edu), and the cDNA clones are readily available for distribution from the chick EST website and clone repository (http://www.chickest.udel.edu). The present paper describes the construction and normalization of single and multiple tissue chicken cDNA libraries, high-throughput EST sequencing from these libraries, the CAP3 assembly of a chicken gene index from all public ESTs, and the identification of several nonredundant chicken gene sets for production of custom DNA microarrays.

Journal ArticleDOI
TL;DR: This EST collection and its annotation provide a significant resource for basic and applied research on T. harzianum, a fungus with a high biotechnological interest.
Abstract: The filamentous fungus Trichoderma harzianum is used as biological control agent of several plant-pathogenic fungi. In order to study the genome of this fungus, a functional genomics project called "TrichoEST" was developed to give insights into genes involved in biological control activities using an approach based on the generation of expressed sequence tags (ESTs). Eight different cDNA libraries from T. harzianum strain CECT 2413 were constructed. Different growth conditions involving mainly different nutrient conditions and/or stresses were used. We here present the analysis of the 8,710 ESTs generated. A total of 3,478 unique sequences were identified of which 81.4% had sequence similarity with GenBank entries, using the BLASTX algorithm. Using the Gene Ontology hierarchy, we performed the annotation of 51.1% of the unique sequences and compared its distribution among the gene libraries. Additionally, the InterProScan algorithm was used in order to further characterize the sequences. The identification of the putatively secreted proteins was also carried out. Later, based on the EST abundance, we examined the highly expressed genes and a hydrophobin was identified as the gene expressed at the highest level. We compared our collection of ESTs with the previous collections obtained from Trichoderma species and we also compared our sequence set with different complete eukaryotic genomes from several animals, plants and fungi. Accordingly, the presence of similar sequences in different kingdoms was also studied. This EST collection and its annotation provide a significant resource for basic and applied research on T. harzianum, a fungus with a high biotechnological interest.

Journal ArticleDOI
19 Oct 2006-Genome
TL;DR: This work constructed 2 bacterial artificial chromosome libraries from an inbred diploid Brachypodium distachyon line, Bd21, using restriction enzymes HindIII and BamHI to develop resources for genome research in this emerging model species and obtained a first view of the sequence composition of the Brachyp sodium genome.
Abstract: Brachypodium is well suited as a model system for temperate grasses because of its compact genome and a range of biological features. In an effort to develop resources for genome research in this emerging model species, we constructed 2 bacterial artificial chromosome (BAC) libraries from an inbred diploid Brachypodium distachyon line, Bd21, using restriction enzymes HindIII and BamHI. A total of 73,728 clones (36,864 per BAC library) were picked and arrayed in 192,384-well plates. The average insert size for the BamHI and HindIII libraries is estimated to be 100 and 105 kb, respectively, and inserts of chloroplast origin account for 4.4% and 2.4%, respectively. The libraries individually represent 9.4- and 9.9-fold haploid genome equivalents with combined 19.3-fold genome coverage, based on a genome size of 355 Mb reported for the diploid Brachypodium, implying a 99.99% probability that any given specific sequence will be present in each library. Hybridization of the libraries with 8 starch biosynthesis genes was used to empirically evaluate this theoretical genome coverage; the frequency at which these genes were present in the library clones gave an estimated coverage of 11.6- and 19.6-fold genome equivalents. To obtain a first view of the sequence composition of the Brachypodium genome, 2185 BAC end sequences (BES) representing 1.3 Mb of random genomic sequence were compared with the NCBI GenBank database and the GIRI repeat database. Using a cutoff expectation value of E<10-10, only 3.3% of the BESs showed similarity to repetitive sequences in the existing database, whereas 40.0% had matches to the sequences in the EST database, suggesting that a considerable portion of the Brachypodium genome is likely transcribed. When the BESs were compared with individual EST databases, more matches hit wheat than maize, although their EST collections are of a similar size, further supporting the close relationship between Brachypodium and the Triticeae. Moreover, 122 BESs have significant matches to wheat ESTs mapped to individual chromosome bin positions. These BACs represent colinear regions containing the mapped wheat ESTs and would be useful in identifying additional markers for specific wheat chromosome regions.

Journal ArticleDOI
TL;DR: DWARF has been designed for constructing databases of large structurally related protein families and for evaluating their sequence-structure-function relationships by a systematic analysis of sequence, structure and functional annotation.
Abstract: The emerging field of integrative bioinformatics provides the tools to organize and systematically analyze vast amounts of highly diverse biological data and thus allows to gain a novel understanding of complex biological systems. The data warehouse DWARF applies integrative bioinformatics approaches to the analysis of large protein families. The data warehouse system DWARF integrates data on sequence, structure, and functional annotation for protein fold families. The underlying relational data model consists of three major sections representing entities related to the protein (biochemical function, source organism, classification to homologous families and superfamilies), the protein sequence (position-specific annotation, mutant information), and the protein structure (secondary structure information, superimposed tertiary structure). Tools for extracting, transforming and loading data from public available resources (ExPDB, GenBank, DSSP) are provided to populate the database. The data can be accessed by an interface for searching and browsing, and by analysis tools that operate on annotation, sequence, or structure. We applied DWARF to the family of α/β-hydrolases to host the Lipase Engineering database. Release 2.3 contains 6138 sequences and 167 experimentally determined protein structures, which are assigned to 37 superfamilies 103 homologous families. DWARF has been designed for constructing databases of large structurally related protein families and for evaluating their sequence-structure-function relationships by a systematic analysis of sequence, structure and functional annotation. It has been applied to predict biochemical properties from sequence, and serves as a valuable tool for protein engineering.

Journal ArticleDOI
TL;DR: A large number of researchers fail to deposit DNA sequences in public databases such as GenBank when the paper is published, but how often do authors fail to do so?
Abstract: A recent surge of interest in data sharing and data access has swept through the scientific community (e.g., [ 1]). Scientists recognize that free access to data is synergistic for fostering major advances. Concerns about standards of sharing are particularly acute with respect to large-scale DNA sequence and microarray data. Although some types of data have shallow histories or unclear protocols for how one would share them, DNA sequences have been deposited to the joint databases of GenBank, EMBL, and the DNA Databank of Japan [ 2, 3, 4] for over a decade, and many journals have policies requiring such submission before a paper can be accepted. For simplicity, we refer to these databases jointly as “GenBank.” However, policies are only as good as their adherence and enforcement, and one should consider the effectiveness of past policies before launching new ones. Given its long history, an ideal case study is the deposit of sequences reported in published work into GenBank. We know from personal experience that authors of published papers reporting DNA sequences sometimes intentionally fail to deposit their sequences to GenBank and refuse to release them upon request. Is this a rare exception, or do many papers make it past coauthors, associate editors, editors, reviewers, and journal staff without providing the purportedly required data accession numbers? We examined the frequency with which published studies failed to submit their DNA sequences to GenBank. We searched six journals that have explicit policies requiring the submission of DNA sequences ( Table 1). The previous six months (through February, 2006) of issues or most recent 30 papers presenting DNA sequence data were sought, though we examined a year of issues of one journal to test a longer-term trend (see below). Studies presenting only confirmatory sequence or sequences that would not be found in nature (e.g., transgenes) were not counted. We searched for GenBank entries using locus and species names, authors of the article, etc., thus executing a search that would mimic any scientist's approach. Table 1 Submission of DNA Sequences from Published Studies to GenBank No journal had complete compliance with its requirement for all DNA sequences to have been submitted to GenBank ( Table 1). Between 3% and 20% of papers in these journals did not include GenBank accession numbers, and between 3% and 15% of studies never submitted their DNA sequences at all. We also identified several papers with errors in the supplied accession numbers, but these errors were not counted. Table 1 also notes “special cases” that we considered less egregious violations of the journal rules. In these cases, the DNA sequences were noted in the paper itself, either as supplementary materials or by noting identity to published sequences. The observation that some papers have sequences submitted to GenBank despite not providing accession numbers suggests that common reasons may be forgetfulness of the authors coupled with the lack of consistent policing at the journals. An author forgets to deposit the sequences, but then remembers (or is reminded) after the paper is published and submits them at that time. If this “oops effect” is common, we expect that the publication dates of the papers completely lacking GenBank submissions should be later on average than the publication dates of papers only missing accession numbers. We test this prediction with our observations from the journal Molecular Biology and Evolution, for which our survey spans a longer time period. Consistent with the prediction, we find a marginally significant difference in publication month ( t = 2.2, p = 0.08) in the expected direction between papers for which sequences were never submitted to GenBank and those for which sequences were submitted but accession numbers were not printed. Our study concludes that the majority of papers provide their DNA sequences to GenBank. However, from the perspective of the journal, it is understandable how some papers get into print without providing sequence accession numbers. Paper submissions are closely checked for content by reviewers and associate editors, and they do not see it as their duty to police these policies—they merely evaluate the science and presentation. In contrast, the staff at many journals are often not scientists and not trained to recognize when and how these should be presented. Hence, although the staff may be asked to police this policy, it is understandable that they may miss several cases. Although the failure to submit DNA sequences to GenBank appears rare, we must consider the consequences to an author if he/she intentionally publishes a paper without providing access to the data. There are two possibilities of how enforcement by the journal could be achieved. The author could be “flagged” such that future submissions to the journal would be declined until the DNA sequences have been released. This is a rather aggressive stance, and journals are unlikely to adopt it. We suggest an easy alternative. In the 21st century, many writers access publications online. We propose that, in cases where an author has not released DNA sequences, the author be given one month notice, at which point, if accession numbers are not provided, the publication is removed from the journal Web site until compliance is reached. One cannot take back the printed issue, but having the publication removed from the journal website would prevent anyone from accessing it online. This approach would focus the consequences to the deviant publication. The databases of GenBank, EMBL, and the DNA Databank of Japan [ 2, 3, 4] serve as a model for data sharing from which the entire scientific community can learn. Although they sometimes get bad publicity for errors in DNA sequence submissions (e.g., see [ 5]), the positive impact they have had on all areas of biology is enormous. Let us look to the future and hope that new proposed forms of data sharing (e.g., [ 1]) are even more successful.

Journal ArticleDOI
TL;DR: The present characterization of the Bet v1 genes provides a framework for the screening of specific Bet v 1 genes among other B. pendula cultivars or Betula species, and for future breeding for trees with a reduced allergenicity.
Abstract: Background Pollen of the European white birch (Betula pendula, syn. B. verrucosa) is an important cause of hay fever. The main allergen is Bet v 1, member of the pathogenesis-related class 10 (PR-10) multigene family. To establish the number of PR-10/Bet v 1 genes and the isoform diversity within a single tree, PCR amplification, cloning and sequencing of PR-10 genes was performed on two diploid B. pendula cultivars and one interspecific tetraploid Betula hybrid. Sequences were attributed to putative genes based on sequence identity and intron length. Information on transcription was derived by comparison with homologous cDNA sequences available in GenBank/EMBL/DDJB. PCR-cloning of multigene families is accompanied by a high risk for the occurrence of PCR recombination artifacts. We screened for and excluded these artifacts, and also detected putative artifact sequences among database sequences.

Journal ArticleDOI
TL;DR: The expressed sequence tags (ESTs) referenced in this report are the first transcriptomes in a leaf from a half-shade ginseng plant and the majority of the identified transcripts were found to be genes related with energy, metabolism, subcellular localization, and protein synthesis and transport.
Abstract: The expressed sequence tags (ESTs) referenced in this report are the first transcriptomes in a leaf from a half-shade ginseng plant. A cDNA library was constructed from samples of the leaves of 4-year-old Panax ginseng plants, which were cultured in a field. The 2,896 P. ginseng cDNA clones represent 1,576 unique sequences, consisting of 1,167 singletons and 409 contig sequences. BLAST comparisons of the cDNAs in GenBank's non-redundant databases revealed that 2,579 of the 2,896 cDNAs (89.1%) exhibited a high degree of sequence homology to genes from other organisms. The majority of the identified transcripts were found to be genes related with energy, metabolism, subcellular localization, and protein synthesis and transport. The chlorophyll a/b-binding protein ESTs in the ginseng leaf samples manifested a substantially higher level of expression than was observed in other plant leaves. The ESTs involved in ginsenoside biosynthesis were also identified and discussed.

Journal ArticleDOI
TL;DR: Analyses of the relationship between Glu-D3 alleles defined by protein electrophoretic mobility and different GluD3 gene variations at the DNA or protein level provided molecular basis for DNA based identification of glutenin alleles.
Abstract: Low-molecular-weight glutenins (LMW-GS) in common wheat (Triticum aestivum L.) are of great importance for processing quality of pan bread and noodles. The objectives of this study are to identify LMW-GS coding genes at GluD3 locus on chromosome 1D and to establish relationships between these genes and GluD3 alleles (a, b, c, d, and e) defined by protein electrophoretic mobility. Specific primer sets were designed to amplify each of the three LMW-GS chromosome 1D gene regions including upstream, coding and downstream regions of eight wheat cultivars containing GluD3 a, b, c, d and e alleles. Three LMW-GS genes, designated as GluD3-1, GluD3-2 and GluD3-3, were amplified from the eight wheat cultivars. The allelic variants of these three genes were analysed at the DNA and protein level. GluD3-1 showed two allelic variants or haplotypes, one common to cultivars containing protein alleles a, d and e (designated GluD3-11) and the other was present in cultivars with alleles b and c (designated GluD3-12). Comparing with GluD3-12, a 3-bp deletion was found in the coding region of the N-terminal repetitive domain of GluD3-11, leading to a glutamine deletion at the 116th position. GluD3-2 had three variants at the DNA level in the eight cultivars, which were designated as GluD3-21, GluD3-22 and GluD3-23. In comparison to GluD3-21, a single nucleotide polymorphism (SNP) was detected for GluD3-22 in the signal peptide region, resulting in an amino acid change from alanine to threonine at the 11th position; and 11 mutations were found at GluD3-23, with five in upstream region, four in coding region and two in downstream region, respectively. GluD3-3 had two haplotypes, designated as GluD3-31 and GluD3-32, both belonging to LMW-s glutenin subunits though their first amino acids in N-terminal region are different. Compared with the GenBank GluD3 genes, nucleotide sequences of GluD3-21 and GluD3-23 were the same as X13306 and AB062875, respectively. GluD3-22 and GluD3-11 had only one-base difference from U86027 and AB062865. GluD3-12 was not found in the GenBank database, indicating a newly identified GluD3 gene variation. GluD3-3 was a new gene different from any other known GluD3 genes. Analyses of the relationship between Glu-D3 alleles defined by protein electrophoretic mobility and different GluD3 gene variations at the DNA or protein level provided molecular basis for DNA based identification of glutenin alleles.

Journal ArticleDOI
Sen-Yang Cao1, Xiaobing Wu1, Peng Yan1, Yu-Ling Hu1, Xia Su1, Zhigang Jiang 
TL;DR: A phylogenetic tree with maximum likelihood (ML) and maximum parsimony (MP) methods is constructed and the phylogenetic relationships among 11 species of Anura is discussed.

Journal ArticleDOI
TL;DR: Mammalian Promoter Database (MPromDb), a novel database that integrates gene promoters with experimentally supported annotation of transcription start sites, cis-regulatory elements, CpG islands and chromatin immunoprecipitation microarray experimental results with intuitively designed presentation, is developed.
Abstract: We have developed Mammalian Promoter Database (MPromDb), a novel database that integrates gene promoters with experimentally supported annotation of transcription start sites, cis-regulatory elements, CpG islands and chromatin immunoprecipitation microarray (ChIP-chip) experimental results with intuitively designed presentation. Release 1.0 of MPromDb currently contains 36,407 promoters and first exons (19,170 from human, 15,953 from mouse and 1284 from rat), 3739 transcription factor (TF)-binding sites (2027 from human, 1181 mouse and 531 rat) and 224 TFs with links to PubMed and GenBank references. Target promoters of TFs that have been identified by ChIP-chip assay are integrated into the database. MPromDb serves as a portal for genome-wide promoter analysis of data generated by ChIP-chip experimental studies. MPromDb can be accessed from http://bioinformatics.med.ohio-state.edu/MPromDb.