scispace - formally typeset
Search or ask a question

Showing papers on "GenBank published in 2012"


Journal ArticleDOI
TL;DR: The NCBI Taxonomy database is a central organizing hub for many of the resources at the NCBI, and provides a means for clustering elements within other domains of NCBI web site, for internal linking between domains of the Entrez system and for linking out to taxon-specific external resources on the web.
Abstract: The NCBI Taxonomy database (http://www.ncbi.nlm.nih.gov/taxonomy) is the standard nomenclature and classification repository for the International Nucleotide Sequence Database Collaboration (INSDC), comprising the GenBank, ENA (EMBL) and DDBJ databases. It includes organism names and taxonomic lineages for each of the sequences represented in the INSDC's nucleotide and protein sequence databases. The taxonomy database is manually curated by a small group of scientists at the NCBI who use the current taxonomic literature to maintain a phylogenetic taxonomy for the source organisms represented in the sequence databases. The taxonomy database is a central organizing hub for many of the resources at the NCBI, and provides a means for clustering elements within other domains of NCBI web site, for internal linking between domains of the Entrez system and for linking out to taxon-specific external resources on the web. Our primary purpose is to index the domain of sequences as conveniently as possible for our user community.

1,142 citations


Journal ArticleDOI
TL;DR: CPGAVAS allows the semi-automatic and complete annotation of a chloroplast genome sequence, and the visualization, editing and analysis of the annotation results, and will become an indispensible tool for researchers studyingchloroplast genomes.
Abstract: The complete sequences of chloroplast genomes provide wealthy information regarding the evolutionary history of species. With the advance of next-generation sequencing technology, the number of completely sequenced chloroplast genomes is expected to increase exponentially, powerful computational tools annotating the genome sequences are in urgent need. We have developed a web server CPGAVAS. The server accepts a complete chloroplast genome sequence as input. First, it predicts protein-coding and rRNA genes based on the identification and mapping of the most similar, full-length protein, cDNA and rRNA sequences by integrating results from Blastx, Blastn, protein2genome and est2genome programs. Second, tRNA genes and inverted repeats (IR) are identified using tRNAscan, ARAGORN and vmatch respectively. Third, it calculates the summary statistics for the annotated genome. Fourth, it generates a circular map ready for publication. Fifth, it can create a Sequin file for GenBank submission. Last, it allows the extractions of protein and mRNA sequences for given list of genes and species. The annotation results in GFF3 format can be edited using any compatible annotation editing tools. The edited annotations can then be uploaded to CPGAVAS for update and re-analyses repeatedly. Using known chloroplast genome sequences as test set, we show that CPGAVAS performs comparably to another application DOGMA, while having several superior functionalities. CPGAVAS allows the semi-automatic and complete annotation of a chloroplast genome sequence, and the visualization, editing and analysis of the annotation results. It will become an indispensible tool for researchers studying chloroplast genomes. The software is freely accessible from http://www.herbalgenomics.org/cpgavas .

501 citations


Journal ArticleDOI
TL;DR: The generation of a 63-fold coverage draft genome sequence of N. benthamiana is reported and it will be useful for comparative genomics in the Solanaceae family as shown here by the discovery of microsynteny between N.benthamiana and tomato in the region encompassing the Pto and Prf genes.
Abstract: Nicotiana benthamiana is a widely used model plant species for the study of fundamental questions in molecular plant-microbe interactions and other areas of plant biology. This popularity derives from its well-characterized susceptibility to diverse pathogens and, especially, its amenability to virus-induced gene silencing and transient protein expression methods. Here, we report the generation of a 63-fold coverage draft genome sequence of N. benthamiana and its availability on the Sol Genomics Network for both BLAST searches and for downloading to local servers. The estimated genome size of N. benthamiana is 3 Gb (gigabases). The current assembly consists of approximately 141,000 scaffolds, spanning 2.6 Gb with 50% of the genome sequence contained within scaffolds >89 kilobases. Of the approximately 16,000 N. benthamiana unigenes available in GenBank, >90% are represented in the assembly. The usefulness of the sequence was demonstrated by the retrieval of N. benthamiana orthologs for 24 immunity-associated genes from other species including Ago2, Ago7, Bak1, Bik1, Crt1, Fls2, Pto, Prf, Rar1, and mitogen-activated protein kinases. The sequence will also be useful for comparative genomics in the Solanaceae family as shown here by the discovery of microsynteny between N. benthamiana and tomato in the region encompassing the Pto and Prf genes.

417 citations


Journal ArticleDOI
TL;DR: A mechanism for automatically maintaining a comprehensive, non-redundant protein database and for creating a quarterly release of this resource is introduced and tools for translating similarity searches into many annotation namespaces are presented.
Abstract: Computing of sequence similarity results is becoming a limiting factor in metagenome analysis. Sequence similarity search results encoded in an open, exchangeable format have the potential to limit the needs for computational reanalysis of these data sets. A prerequisite for sharing of similarity results is a common reference. We introduce a mechanism for automatically maintaining a comprehensive, non-redundant protein database and for creating a quarterly release of this resource. In addition, we present tools for translating similarity searches into many annotation namespaces, e.g. KEGG or NCBI's GenBank. The data and tools we present allow the creation of multiple result sets using a single computation, permitting computational results to be shared between groups for large sequence data sets.

281 citations


Journal ArticleDOI
TL;DR: A bioinformatic pipeline, the Viral Informatics Resource for Metagenome Exploration (VIROME), that emphasizes the classification of viral metagenome sequences (predicted open-reading frames) based on homology search results against both known and environmental sequences.
Abstract: One consistent finding among studies using shotgun metagenomics to analyze whole viral communities is that most viral sequences show no significant homology to known sequences. Thus, bioinformatic analyses based on sequence collections such as GenBank nr, which are largely comprised of sequences from known organisms, tend to ignore a majority of sequences within most shotgun viral metagenome libraries. Here we describe a bioinformatic pipeline, the Viral Informatics Resource for Metagenome Exploration (VIROME), that emphasizes the classification of viral metagenome sequences (predicted open-reading frames) based on homology search results against both known and environmental sequences. Functional and taxonomic information is derived from five annotated sequence databases which are linked to the UniRef 100 database. Environmental classifications are obtained from hits against a custom database, MetaGenomes On-Line, which contains 49 million predicted environmental peptides. Each predicted viral metagenomic ORF run through the VIROME pipeline is placed into one of seven ORF classes, thus, every sequence receives a meaningful annotation. Additionally, the pipeline includes quality control measures to remove contaminating and poor quality sequence and assesses the potential amount of cellular DNA contamination in a viral metagenome library by screening for rRNA genes. Access to the VIROME pipeline and analysis results are provided through a web-application interface that is dynamically linked to a relational back-end database. The VIROME web-application interface is designed to allow users flexibility in retrieving sequences (reads, ORFs, predicted peptides) and search results for focused secondary analyses.

170 citations


Journal ArticleDOI
TL;DR: To characterize the complete genome of this virus, 2 PCR-positive samples (FRHEV4 and FRHEV20) were selected, and different sets of specific primers on the basis of sequence fragments obtained by 454 pyrosequencing were developed and confirmed by nucleotide sequencing.
Abstract: To the Editor: Hepatitis E virus (HEV), a member of the family Hepeviridae and the genus Hepevirus, is transmitted by the fecal–oral route and causes liver inflammation, which leads to mortality rates of ≤20% in pregnant woman (1,2). Human hepatitis E is a major disease not only in developing countries but also in industrialized countries, and identification of animal strains of HEV in pigs and deer and its zoonotic potential has raised considerable public health concerns (1,3). Recent reports suggest that other animals such as rats, mongooses, chickens, rabbits, and trout also may harbor HEVs (1–5). The genomes of these viruses are ≈6.6 kb–7.2 kb and encode 3 open reading frames (ORFs) flanked by a capped 5′ end and a poly A tail at the 3′ end (1,3). We used random PCR amplification and high-throughput sequencing technology to investigate HEV sequences in ferrets (Mustela putorius) from the Netherlands. In 2010, fecal samples were collected from ferrets in the Netherlands and stored at −80°C. Samples that were negative for ferret coronavirus (6) were further characterized for other pathogens. Viral RNA was isolated and viral metagenomic libraries were constructed for 454 pyrosequencing as described (7,8), and 248,840 sequence reads were generated from 7 fecal samples. Using Blastn and Blastx (www.ncbi.nlm.nih.gov/BLAST), we identified 289 sequence reads in 1 sample that were related to rat HEV and that could be assembled into 6 contigs covering ≈50% of the ferret HEV (FRHEV) genome. We then developed a set of nested PCR primers on the basis of obtained sequences to detect viral RNA (Technical Appendix Table 1). Total RNA extracted from 43 ferret fecal samples collected from 19 locations in the Netherlands was used to perform reverse transcription PCR amplification. Using this PCR, we detected viral RNA in 4 (9.3%) fecal samples tested from 4 locations (distance between each sampling location ranged from 25 km to 127 km). All amplicons were confirmed by nucleotide sequencing. We have limited information regarding the clinical disease this virus may cause because these samples were obtained from household pet ferrets that did not show overt clinical signs. In addition, 4/16 animals from a single farm were IgG positive when tested for IgG against HEV by using recombinant human HEV protein (Wantai, Beijing, China). To characterize the complete genome of this virus, we selected 2 PCR-positive samples (FRHEV4 and FRHEV20), developed different sets of specific primers on the basis of sequence fragments obtained by 454 pyrosequencing, directly sequenced amplicons by Sanger sequencing, and used a rapid amplification of cDNA ends PCR to obtain 5′ and 3′ frame end sequences. Using overlapping fragments we assembled 2 complete FRHEV genome sequences that contained 6,854 nt, including a 13-nt 3′ poly A tail and a 12-nt 5′ end. FRHEV full-genome sequences FRHEV4 and FRHEV20 showed 98.6% sequence identity and were deposited into GenBank under accession nos. {"type":"entrez-nucleotide","attrs":{"text":"JN998606","term_id":"397310723","term_text":"JN998606"}}JN998606 and {"type":"entrez-nucleotide","attrs":{"text":"JN998607","term_id":"397310728","term_text":"JN998607"}}JN998607, respectively. The FRHEV genome contains a complete ORF1 gene that encodes a nonstructural protein of 1,596 aa, an ORF2 gene that encodes a capsid protein of 654 aa, an ORF3 gene that encodes a phosphoprotein of 108 aa, and a 3′ noncoding region of 78 nt. Sequence analyses indicated that the FRHEV genome shared the highest identity (72.3%) with rat HEV. Sequence identity with HEV genotypes 1–4 and rabbit and avian HEVs ranged from 54.5% to 60.5% (Technical Appendix Table 2). The FRHEV genome organization was found to be slightly different from other HEVs and included a putative ORF (ORF4) of 552 nt that overlapped with ORF1 (Technical Appendix Figure). A similar pattern of genome organization was observed for both FRHEVs. Phylogenetic analysis of the complete genomes clearly showed that FRHEV was separated from genotype 1–4 HEVs and clustered with rat HEV (Figure). Similar phylogenetic clustering was observed when nucleotide and deduced amino acid sequences of ORF1, ORF2, and ORF3 were analyzed separately. The phylogenetic distance between rat HEV and FRHEV is larger than the distance between genotype 1 and genotype 2 HEV. Figure Phylogenetic tree based on the complete genomic sequences of ferret hepatitis E viruses (HEVs) and human, rabbit, swine, avian, and rat HEV strains. Names of HEV strains follow GenBank accession numbers. Sequence alignment was performed by using ClustalW ... In recent years, an increasing number of sporadic cases of hepatitis E have been reported (1,9). Several observations suggest that autochthonous cases are caused by zoonotic spread of infection from wild or domestic animals (1,3,9). In addition, IgG anti-HEV seropositivity in the United States has been associated with several factors, including having a pet at home (10). Further studies are needed to identify the zoonotic potential of FRHEV. Technical Appendix: Genome organization of hepatitis E viruses (HEVs) and initiation of translation of open reading frame 1 (ORF1), ORF2, and ORF3 of ferret HEV. Click here to view.(190K, pdf)

166 citations


Journal ArticleDOI
14 May 2012-PLOS ONE
TL;DR: The genome structure, gene order, gene and intron contents, AT contents, codon usage, and transcription units are similar to the typical angiosperm cp genomes.
Abstract: Sesamum indicum is an important crop plant species for yielding oil. The complete chloroplast (cp) genome of S. indicum (GenBank acc no. JN637766) is 153,324 bp in length, and has a pair of inverted repeat (IR) regions consisting of 25,141 bp each. The lengths of the large single copy (LSC) and the small single copy (SSC) regions are 85,170 bp and 17,872 bp, respectively. Comparative cp DNA sequence analyses of S. indicum with other cp genomes reveal that the genome structure, gene order, gene and intron contents, AT contents, codon usage, and transcription units are similar to the typical angiosperm cp genomes. Nucleotide diversity of the IR region between Sesamum and three other cp genomes is much lower than that of the LSC and SSC regions in both the coding region and noncoding region. As a summary, the regional constraints strongly affect the sequence evolution of the cp genomes, while the functional constraints weakly affect the sequence evolution of cp genomes. Five short inversions associated with short palindromic sequences that form step-loop structures were observed in the chloroplast genome of S. indicum. Twenty-eight different simple sequence repeat loci have been detected in the chloroplast genome of S. indicum. Almost all of the SSR loci were composed of A or T, so this may also contribute to the A-T richness of the cp genome of S. indicum. Seven large repeated loci in the chloroplast genome of S. indicum were also identified and these loci are useful to developing S. indicum-specific cp genome vectors. The complete cp DNA sequences of S. indicum reported in this paper are prerequisite to modifying this important oilseed crop by cp genetic engineering techniques.

134 citations


Journal ArticleDOI
07 Nov 2012-PLOS ONE
TL;DR: This study demonstrates how this molecular identification method combined with high-throughput sequencing can open new realms of possibilities in achieving fast, accurate and inexpensive species identification.
Abstract: Rodentia is the most diverse order among mammals, with more than 2,000 species currently described. Most of the time, species assignation is so difficult based on morphological data solely that identifying rodents at the specific level corresponds to a real challenge. In this study, we compared the applicability of 100 bp mini-barcodes from cytochrome b and cytochrome c oxidase 1 genes to enable rodent species identification. Based on GenBank sequence datasets of 115 rodent species, a 136 bp fragment of cytochrome b was selected as the most discriminatory mini-barcode, and rodent universal primers surrounding this fragment were designed. The efficacy of this new molecular tool was assessed on 946 samples including rodent tissues, feces, museum samples and feces/pellets from predators known to ingest rodents. Utilizing next-generation sequencing technologies able to sequence mixes of DNA, 1,140 amplicons were tagged, multiplexed and sequenced together in one single 454 GS-FLX run. Our method was initially validated on a reference sample set including 265 clearly identified rodent tissues, corresponding to 103 different species. Following validation, 85.6% of 555 rodent samples from Europe, Asia and Africa whose species identity was unknown were able to be identified using the BLASTN program and GenBank reference sequences. In addition, our method proved effective even on degraded rodent DNA samples: 91.8% and 75.9% of samples from feces and museum specimens respectively were correctly identified. Finally, we succeeded in determining the diet of 66.7% of the investigated carnivores from their feces and 81.8% of owls from their pellets. Non-rodent species were also identified, suggesting that our method is sensitive enough to investigate complete predator diets. This study demonstrates how this molecular identification method combined with high-throughput sequencing can open new realms of possibilities in achieving fast, accurate and inexpensive species identification.

103 citations


Journal ArticleDOI
TL;DR: The Database for Bacterial Group II Introns (http://webapps2.ucalgary.ca/~groupii/index.html#) provides a catalogue of full-length, non-redundant group II introns present in bacterial DNA sequences in GenBank.
Abstract: The Database for Bacterial Group II Introns (http://webapps2.ucalgary.ca/~groupii/index.html#) provides a catalogue of full-length, non-redundant group II introns present in bacterial DNA sequences in GenBank. The website is divided into three sections. The first section provides general information on group II intron properties, structures and classification. The second and main section lists information for individual introns, including insertion sites, DNA sequences, intron-encoded protein sequences and RNA secondary structure models. The final section provides tools for identification and analysis of intron sequences. These include a step-by-step guide to identify introns in genomic sequences, a local BLAST tool to identify closest intron relatives to a query sequence, and a boundary-finding tool that predicts 5' and 3' intron-exon junctions in an input DNA sequence. Finally, selected intron data can be downloaded in FASTA format. It is hoped that this database will be a useful resource not only to group II intron and RNA researchers, but also to microbiologists who encounter these unexpected introns in genomic sequences.

95 citations


Journal ArticleDOI
TL;DR: Full-length E2 encoding sequences proved to be most suitable for reliable and statistically significant phylogeny and analyses revealed results as good as obtained with the much longer entire 5´NTR-E2 sequences, recommending this strategy as a solid and improved basis for CSFV molecular epidemiology.
Abstract: Molecular epidemiology has proven to be an essential tool in the control of classical swine fever (CSF) and its use has significantly increased during the past two decades. Phylogenetic analysis is a prerequisite for virus tracing and thus allows implementing more effective control measures. So far, fragments of the 5´NTR (150 nucleotides, nt) and the E2 gene (190 nt) have frequently been used for phylogenetic analyses. The short sequence lengths represent a limiting factor for differentiation of closely related isolates and also for confidence levels of proposed CSFV groups and subgroups. In this study, we used a set of 33 CSFV isolates in order to determine the nucleotide sequences of a 3508–3510 nt region within the 5´ terminal third of the viral genome. Including 22 additional sequences from GenBank database different regions of the genome, comprising the formerly used short 5´NTR and E2 fragments as well as the genomic regions encoding the individual viral proteins Npro, C, Erns, E1, and E2, were compared with respect to variability and suitability for phylogenetic analysis. Full-length E2 encoding sequences (1119 nt) proved to be most suitable for reliable and statistically significant phylogeny and analyses revealed results as good as obtained with the much longer entire 5´NTR-E2 sequences. This strategy is therefore recommended by the EU and OIE Reference Laboratory for CSF as it provides a solid and improved basis for CSFV molecular epidemiology. Finally, the power of this method is illustrated by the phylogenetic analysis of closely related CSFV isolates from a recent outbreak in Lithuania.

84 citations


Journal ArticleDOI
TL;DR: In this article, a high-throughput system for the identification of novel crystal protein genes (cry) from Bacillus thuringiensis strains was designed, which employed three different kinds of well-developed prediction methods, BLAST, hidden Markov model (HMM), and support vector machine (SVM), to predict the presence of Cry toxin genes.
Abstract: We have designed a high-throughput system for the identification of novel crystal protein genes (cry) from Bacillus thuringiensis strains. The system was developed with two goals: (i) to acquire the mixed plasmid-enriched genomic sequence of B. thuringiensis using next-generation sequencing biotechnology, and (ii) to identify cry genes with a computational pipeline (using BtToxin_scanner). In our pipeline method, we employed three different kinds of well-developed prediction methods, BLAST, hidden Markov model (HMM), and support vector machine (SVM), to predict the presence of Cry toxin genes. The pipeline proved to be fast (average speed, 1.02 Mb/min for proteins and open reading frames [ORFs] and 1.80 Mb/min for nucleotide sequences), sensitive (it detected 40% more protein toxin genes than a keyword extraction method using genomic sequences downloaded from GenBank), and highly specific. Twenty-one strains from our laboratory's collection were selected based on their plasmid pattern and/or crystal morphology. The plasmid-enriched genomic DNA was extracted from these strains and mixed for Illumina sequencing. The sequencing data were de novo assembled, and a total of 113 candidate cry sequences were identified using the computational pipeline. Twenty-seven candidate sequences were selected on the basis of their low level of sequence identity to known cry genes, and eight full-length genes were obtained with PCR. Finally, three new cry-type genes (primary ranks) and five cry holotypes, which were designated cry8Ac1, cry7Ha1, cry21Ca1, cry32Fa1, and cry21Da1 by the B. thuringiensis Toxin Nomenclature Committee, were identified. The system described here is both efficient and cost-effective and can greatly accelerate the discovery of novel cry genes.

Journal ArticleDOI
TL;DR: This study quantifies ORF fragmentation in draft microbial genomes and its effect on annotation efficacy, and proposes a solution to ameliorate this problem and suggests that accounting for gene fragmentation and its associated biases is important when designing comparative genomic projects.
Abstract: Ongoing technological advances in genome sequencing are allowing bacterial genomes to be sequenced at ever-lower cost. However, nearly all of these new techniques concomitantly decrease genome quality, primarily due to the inability of their relatively short read lengths to bridge certain genomic regions, e.g., those containing repeats. Fragmentation of predicted open reading frames (ORFs) is one possible consequence of this decreased quality. In this study we quantify ORF fragmentation in draft microbial genomes and its effect on annotation efficacy, and we propose a solution to ameliorate this problem. A survey of draft-quality genomes in GenBank revealed that fragmented ORFs comprised > 80% of the predicted ORFs in some genomes, and that increased fragmentation correlated with decreased genome assembly quality. In a more thorough analysis of 25 Streptomyces genomes, fragmentation was especially enriched in some protein classes with repeating, multi-modular structures such as polyketide synthases, non-ribosomal peptide synthetases and serine/threonine kinases. Overall, increased genome fragmentation correlated with increased false-negative Pfam and COG annotation rates and increased false-positive KEGG annotation rates. The false-positive KEGG annotation rate could be ameliorated by linking fragmented ORFs using their orthologs in related genomes. Whereas this strategy successfully linked up to 46% of the total ORF fragments in some genomes, its sensitivity appeared to depend heavily on the depth of sampling of a particular taxon's variable genome. Draft microbial genomes contain many ORF fragments. Where these correspond to the same gene they have particular potential to confound comparative gene content analyses. Given our findings, and the rapid increase in the number of microbial draft quality genomes, we suggest that accounting for gene fragmentation and its associated biases is important when designing comparative genomic projects.

Journal ArticleDOI
TL;DR: This study has provided the most complete dinoflagellate gene catalog known to date and exploited RNA-Seq to address fundamental issues in basic transcription mechanisms and sequence conservation in these algae.
Abstract: Dinoflagellates are an important component of the marine biota, but a large genome with high-copy number (up to 5,000) tandem gene arrays has made genomic sequencing problematic. More importantly, little is known about the expression and conservation of these unusual gene arrays. We assembled de novo a gene catalog of 74,655 contigs for the dinoflagellate Lingulodinium polyedrum from RNA-Seq (Illumina) reads. The catalog contains 93% of a Lingulodinium EST dataset deposited in GenBank and 94% of the enzymes in 16 primary metabolic KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways, indicating it is a good representation of the transcriptome. Analysis of the catalog shows a marked underrepresentation of DNA-binding proteins and DNA-binding domains compared with other algae. Despite this, we found no evidence to support the proposal of polycistronic transcription, including a marked underrepresentation of sequences corresponding to the intergenic spacers of two tandem array genes. We also have used RNA-Seq to assess the degree of sequence conservation in tandem array genes and found their transcripts to be highly conserved. Interestingly, some of the sequences in the catalog have only bacterial homologs and are potential candidates for horizontal gene transfer. These presumably were transferred as single-copy genes, and because they are now all GC-rich, any derived from AT-rich contexts must have experienced extensive mutation. Our study not only has provided the most complete dinoflagellate gene catalog known to date, it has also exploited RNA-Seq to address fundamental issues in basic transcription mechanisms and sequence conservation in these algae.

Journal ArticleDOI
TL;DR: Investigating how many species of multicellular animals have been barcoded finds species coverage is considerably better for target taxa of DNA barcoding campaigns (e.g. birds, fishes, Lepidoptera), although it also falls short of published campaign targets.

Journal ArticleDOI
01 Jan 2012-Gene
TL;DR: The nucleotide variation in both the 16S and 12S sequences was suitable for identifying the large majority of the examined fish specimens to at least the level of genus, but was found to be less useful for the explicit differentiation of certain congeneric fish species.

Journal ArticleDOI
TL;DR: The Heat Shock Protein Information Resource (HSPIR) as discussed by the authors is a database of six major heat shock proteins, namely Hsp70, Hsp40, HSP60, HSC90, Hsc100 and small HSP.
Abstract: Heat shock protein information resource (HSPIR) is a concerted database of six major heat shock proteins (HSPs), namely, Hsp70, Hsp40, Hsp60, Hsp90, Hsp100 and small HSP. The HSPs are essential for the survival of all living organisms, as they protect the conformations of proteins on exposure to various stress conditions. They are a highly conserved group of proteins involved in diverse physiological functions, including de novo folding, disaggregation and protein trafficking. Moreover, their critical role in the control of disease progression made them a prime target of research. Presently, limited information is available on HSPs in reference to their identification and structural classification across genera. To that extent, HSPIR provides manually curated information on sequence, structure, classification, ontology, domain organization, localization and possible biological functions extracted from UniProt, GenBank, Protein Data Bank and the literature. The database offers interactive search with incorporated tools, which enhances the analysis. HSPIR is a reliable resource for researchers exploring structure, function and evolution of HSPs.

Journal ArticleDOI
TL;DR: Investigation of the utility of the ITS rDNA locus for identifying Morchella species, using phylogenetic species previously inferred from multilocus DNA sequence data as a reference, and the need for a dedicated Web-accessible reference database to facilitate the rapid identification of known and novel species.
Abstract: Arguably more mycophiles hunt true morels (Morchella) during their brief fruiting season each spring in the northern hemisphere than any other wild edible fungus. Concerns about overharvesting by individual collectors and commercial enterprises make it essential that science-based management practices and conservation policies are developed to ensure the sustainability of commercial harvests and to protect and preserve morel species diversity. Therefore, the primary objectives of the present study were to: (i) investigate the utility of the ITS rDNA locus for identifying Morchella species, using phylogenetic species previously inferred from multilocus DNA sequence data as a reference; and (ii) clarify insufficiently identified sequences and determine whether the named sequences in GenBank were identified correctly. To this end, we generated 553 Morchella ITS rDNA sequences and downloaded 312 additional ones generated by other researchers from GenBank using emerencia and analyzed them phylogenetically. Three major findings emerged: (i) ITS rDNA sequences were useful in identifying 48/62 (77.4%) of the known phylospecies; however, they failed to identify 12 of the 22 species within the species-rich Elata Subclade and two closely related species in the Esculenta Clade; (ii) at least 66% of the named Morchella sequences in GenBank are misidentified; and (iii) ITS rDNA sequences of up to six putatively novel Morchella species were represented in GenBank. Recognizing the need for a dedicated Web-accessible reference database to facilitate the rapid identification of known and novel species, we constructed Morchella MLST (http://www.cbs.knaw.nl/morchella/), which can be queried with ITS rDNA sequences and those of the four other genes used in our prior multilocus molecular systematic studies of this charismatic genus.

Journal ArticleDOI
TL;DR: This workbench simplifies first phylogenetic analyses to only a few mouse-clicks, while additionally providing tools and data for comprehensive large-scale analyses.
Abstract: The internal transcribed spacer 2 (ITS2) has been used as a phylogenetic marker for more than two decades. As ITS2 research mainly focused on the very variable ITS2 sequence, it confined this marker to low-level phylogenetics only. However, the combination of the ITS2 sequence and its highly conserved secondary structure improves the phylogenetic resolution 1 and allows phylogenetic inference at multiple taxonomic ranks, including species delimitation 2-8 . The ITS2 Database 9 presents an exhaustive dataset of internal transcribed spacer 2 sequences from NCBI GenBank 11 accurately reannotated 10 .

Journal ArticleDOI
TL;DR: The predicted predominance of ovary gene expression and assignment of directly relevant Gene Ontology classes suggests a powerful utility of this dataset for analysis of ovarian gene expression related to fundamental questions of oogenesis.
Abstract: The striped bass and its relatives (genus Morone) are important fisheries and aquaculture species native to estuaries and rivers of the Atlantic coast and Gulf of Mexico in North America. To open avenues of gene expression research on reproduction and breeding of striped bass, we generated a collection of expressed sequence tags (ESTs) from a complementary DNA (cDNA) library representative of their ovarian transcriptome. Sequences of a total of 230,151 ESTs (51,259,448 bp) were acquired by Roche 454 pyrosequencing of cDNA pooled from ovarian tissues obtained at all stages of oocyte growth, at ovulation (eggs), and during preovulatory atresia. Quality filtering of ESTs allowed assembly of 11,208 high-quality contigs ≥ 100 bp, including 2,984 contigs 500 bp or longer (average length 895 bp). Blastx comparisons revealed 5,482 gene orthologues (E-value < 10-3), of which 4,120 (36.7% of total contigs) were annotated with Gene Ontology terms (E-value < 10-6). There were 5,726 remaining unknown unique sequences (51.1% of total contigs). All of the high-quality EST sequences are available in the National Center for Biotechnology Information (NCBI) Short Read Archive (GenBank: SRX007394). Informative contigs were considered to be abundant if they were assembled from groups of ESTs comprising ≥ 0.15% of the total short read sequences (≥ 345 reads/contig). Approximately 52.5% of these abundant contigs were predicted to have predominant ovary expression through digital differential display in silico comparisons to zebrafish (Danio rerio) UniGene orthologues. Over 1,300 Gene Ontology terms from Biological Process classes of Reproduction, Reproductive process, and Developmental process were assigned to this collection of annotated contigs. This first large reference sequence database available for the ecologically and economically important temperate basses (genus Morone) provides a foundation for gene expression studies in these species. The predicted predominance of ovary gene expression and assignment of directly relevant Gene Ontology classes suggests a powerful utility of this dataset for analysis of ovarian gene expression related to fundamental questions of oogenesis. Additionally, a high definition Agilent 60-mer oligo ovary 'UniClone' microarray with 8 × 15,000 probe format has been designed based on this striped bass transcriptome (eArray Group: Striper Group, Design ID: 029004).

01 Jan 2012
TL;DR: HSPIR provides manually curated information on sequence, structure, classification, ontology, domain organization, localization and possible biological functions extracted from UniProt, GenBank, Protein Data Bank and the literature.
Abstract: Summary: Heat shock protein information resource (HSPIR) is a concerted database of six major heat shock proteins (HSPs), namely, Hsp70, Hsp40, Hsp60, Hsp90, Hsp100 and small HSP. The HSPs are essential for the survival of all living organisms, as they protect the conformations of proteins on exposure to various stress conditions. They are a highly conserved group of proteins involved in diverse physiological functions, including de novo folding, disaggregation and protein trafficking. Moreover, their critical role in the control of disease progression made them a prime target of research. Presently, limited information is available on HSPs in reference to their identification and structural classification across genera. To that extent, HSPIR provides manually curated information on sequence, structure, classification, ontology, domain organization, localization and possible biological functions extracted from UniProt, GenBank, Protein Data Bank and the literature. The database offers interactive search with incorporated tools, which enhances the analysis. HSPIR is a reliable resource for researchers exploring structure, function and evolution of HSPs. Availability: http://pdslab.biochem.iisc.ernet.in/hspir/ Contact: patrick@biochem.iisc.ernet.in Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: This study established the first record of E. granulosus sensu stricto, G2 genotype in dogs from Western Iran and genetically characterized using DNA sequencing of the partial mitochondrial cytochrome c oxidase subunit 1 (cox1) and NADH dehydrogenase 1 (nad1).

Journal ArticleDOI
TL;DR: This is the first single primer pair to amplify the complete cyt b gene in a broad range of mammalian species and this primer pair can be used for the addition of new cytb gene sequences and to enhance data available on species represented in GenBank.
Abstract: Sequence-based species identification relies on the extent and integrity of sequence data available in online databases such as GenBank When identifying species from a sample of unknown origin, partial DNA sequences obtained from the sample are aligned against existing sequences in databases When the sequence from the matching species is not present in the database, high-scoring alignments with closely related sequences might produce unreliable results on species identity For species identification in mammals, the cytochrome b (cyt b) gene has been identified to be highly informative; thus, large amounts of reference sequence data from the cyt b gene are much needed To enhance availability of cyt b gene sequence data on a large number of mammalian species in GenBank and other such publicly accessible online databases, we identified a primer pair for complete cyt b gene sequencing in mammals Using this primer pair, we successfully PCR amplified and sequenced the complete cyt b gene from 40 of 44 mammalian species representing 10 orders of mammals We submitted 40 complete, correctly annotated, cyt b protein coding sequences to GenBank To our knowledge, this is the first single primer pair to amplify the complete cyt b gene in a broad range of mammalian species This primer pair can be used for the addition of new cyt b gene sequences and to enhance data available on species represented in GenBank The availability of novel and complete gene sequences as high-quality reference data can improve the reliability of sequence-based species identification

Journal ArticleDOI
TL;DR: The data suggest that selective sweeps by base collection mechanisms more frequently eliminate polymorphisms in the IR region than in other regions, and that cp genome regions that have high levels of base substitutions also show higher incidences of indels.
Abstract: This study reports the complete chloroplast (cp) DNA sequence of Eleutherococcus senticosus (GenBank: JN 637765), an endangered endemic species. The genome is 156,768 bp in length, and contains a pair of inverted repeat (IR) regions of 25,930 bp each, a large single copy (LSC) region of 86,755 bp and a small single copy (SSC) region of 18,153 bp. The structural organization, gene and intron contents, gene order, AT content, codon usage, and transcription units of the E. senticosus chloroplast genome are similar to that of typical land plant cp DNA. We aligned and analyzed the sequences of 86 coding genes, 19 introns and 113 intergenic spacers (IGS) in three different taxonomic hierarchies; Eleutherococcus vs. Panax, Eleutherococcus vs. Daucus, and Eleutherococcus vs. Nicotiana. The distribution of indels, the number of polymorphic sites and nucleotide diversity indicate that positional constraint is more important than functional constraint for the evolution of cp genome sequences in Asterids. For example, the intron sequences in the LSC region exhibited base substitution rates 5-11-times higher than that of the IR regions, while the intron sequences in the SSC region evolved 7-14-times faster than those in the IR region. Furthermore, the Ka/Ks ratio of the gene coding sequences supports a stronger evolutionary constraint in the IR region than in the LSC or SSC regions. Therefore, our data suggest that selective sweeps by base collection mechanisms more frequently eliminate polymorphisms in the IR region than in other regions. Chloroplast genome regions that have high levels of base substitutions also show higher incidences of indels. Thirty-five simple sequence repeat (SSR) loci were identified in the Eleutherococcus chloroplast genome. Of these, 27 are homopolymers, while six are di-polymers and two are tri-polymers. In addition to the SSR loci, we also identified 18 medium size repeat units ranging from 22 to 79 bp, 11 of which are distributed in the IGS or intron regions. These medium size repeats may contribute to developing a cp genome-specific gene introduction vector because the region may use for specific recombination sites.

Journal ArticleDOI
TL;DR: Transgenic analysis in tobacco revealed that up-regulation of TaZ FP15 could significantly improve plant dry mass accumulation via increasing the plant phosphorus acquisition capacity under Pi-deficiency condition and suggested that TaZFP15 is involved in mediation of signal transductions of diverse external stresses.

Journal ArticleDOI
TL;DR: A test of a worldwide collection of whiteflies demonstrates that this combination mtCOI PCR-RFLP technique can reliably distinguish not only the MED from the Middle East-Asia Minor 1 group but also the Q1 from any of the other four MED subclades.
Abstract: The Mediterranean group (commonly known as Q biotype; hereafter MED) of the sweetpotato whitefly, Bemisia tabaci (Gennadius), originated in the Mediterranean region, but it now has been found in at least 10 countries outside the Mediterranean. Collections of B. tabaci from some of these countries exhibit different pest behaviors and pesticide resistance characteristics, yet all may be classified as MED. A phylogenetic analysis of 120 mitochondrial cytochrome oxidase I (mtCOI) sequences (JN966761–JN966880) of MED whiteflies collected in Arizona and of 417 retrieved from the GenBank database resolves the MED into five subclades, designated as Q1–Q5. Only subclades Q1 and Q2 have been detected in the United States. Q1 and the other four subclades (Q2–Q5) differ in the number or position of the AluI recognition sites. Based on the differences in the AluI recognition sites reported here and the previously reported differences in VspI recognition sites, we developed a simple diagnostic technique to ide...

Journal ArticleDOI
TL;DR: In this article, a comparison of genomic sequences of isolates that belong to different genotypes of the infectious laryngotracheitis (ILT) virus was performed using Dideoxy sequencing and Illumina sequencing-by-synthesis.
Abstract: Gallid herpesvirus-1 (GaHV-1), commonly named infectious laryngotracheitis (ILT) virus, causes the respiratory disease in chickens known as ILT. The molecular determinants associated with differences in pathogenicity of GaHV-1 strains are not completely understood, and a comparison of genomic sequences of isolates that belong to different genotypes could help identify genes involved in virulence. Dideoxy sequencing, 454 pyrosequencing and Illumina sequencing-by-synthesis were used to determine the nucleotide sequences of four genotypes of virulent strains from GaHV-1 groups I–VI. Three hundred and twenty-five open reading frames (ORFs) were compared with those of the recently sequenced genome of the Serva vaccine strain. Only four ORFs, ORF C, UL37, ICP4 and US2 differed in amino acid (aa) lengths among the newly sequenced genomes. Genome sequence alignments were used to identify two regions (5′ terminus and the unique short/repeat short junction) that contained deletions. Seventy-eight synonymous and 118 non-synonymous amino acid substitutions were identified with the examined ORFs. Exclusive to the genome of the Serva vaccine strain, seven non-synonymous mutations were identified in the predicted translation products of the genes encoding glycoproteins gB, gE, gL and gM and three non-structural proteins UL28 (DNA packaging protein), UL5 (helicase-primase) and the immediate early protein ICP4. Furthermore, our comparative sequence analysis of published and newly sequenced GaHV-1 isolates has provided evidence placing the cleavage/packaging site (a-like sequence) within the inverted repeats instead of its placement at the 3′ end of the UL region as annotated in the GenBank’s entries NC006623 and HQ630064.

Journal ArticleDOI
TL;DR: DNA sequence homology at the 5'- and 3'-ends of the various structural genes indicated that non-specific priming may allow PCR amplification of heterologous bacteriocin genes.

Journal ArticleDOI
TL;DR: VIGOR has been extended to predict genes for 12 viruses: measles virus, mumps virus, rubella virus, respiratory syncytial virus, alphavirus and Venezuelan equine encephalitis virus.
Abstract: A gene prediction program, VIGOR (Viral Genome ORF Reader), was developed at J. Craig Venter Institute in 2010 and has been successfully performing gene calling in coronavirus, influenza, rhinovirus and rotavirus for projects at the Genome Sequencing Center for Infectious Diseases. VIGOR uses sequence similarity search against custom protein databases to identify protein coding regions, start and stop codons and other gene features. Ribonucleicacid editing and other features are accurately identified based on sequence similarity and signature residues. VIGOR produces four output files: a gene prediction file, a complementary DNA file, an alignment file, and a gene feature table file. The gene feature table can be used to create GenBank submission. VIGOR takes a single input: viral genomic sequences in FASTA format. VIGOR has been extended to predict genes for 12 viruses: measles virus, mumps virus, rubella virus, respiratory syncytial virus, alphavirus and Venezuelan equine encephalitis virus, norovirus, metapneumovirus, yellow fever virus, Japanese encephalitis virus, parainfluenza virus and Sendai virus. VIGOR accurately detects the complex gene features like ribonucleicacid editing, stop codon leakage and ribosomal shunting. Precisely identifying the mat_peptide cleavage for some viruses is a built-in feature of VIGOR. The gene predictions for these viruses have been evaluated by testing from 27 to 240 genomes from GenBank.

Journal ArticleDOI
TL;DR: The results obtained demonstrate the feasibility of complete ITS1 sequences in C. sinensis population genetics and can be considered as a basis for further studies of the parasite infection because they may help to elucidate the molecular mechanisms of pathogen evolution and adaptation.

Journal ArticleDOI
TL;DR: The discovery of two novel picornaviruses in farm animals, cattle and sheep in Hungary raises the possibility that hungaroviruses, human parechoviraluses, and porcine teschoviruses may be linked to each other by modular recombination of functional noncoding RNA elements.
Abstract: Two novel picornaviruses were serendipitously identified in apparently healthy young domestic animals—cattle (Bos taurus) and, subsequently, sheep (Ovis aries)—in Hungary during 2008 and 2009. Complete genome sequencing and comparative analysis showed that the two viruses are related to each other and have identical genome organizations, VPg + 5′ UTR IRES-II [L/1A-1B-1C-1D-2A NPG↓P /2B-2C/3A-3B VPg -3C pro -3D pol ] 3′ UTR-poly(A). We suggest that they form two novel viral genotypes/serotypes, bovine hungarovirus 1 (BHuV-1; GenBank accession number JQ941880) and ovine hungarovirus 1 (OHuV-1; GenBank accession number HM153767), which may belong to a potential novel picornavirus genus in the family Picornaviridae. The genome lengths of BHuV-1 and OHuV-1 are 7,583 and 7,588 nucleotides, each comprising a single open reading frame encoding 2,243 and 2,252 amino acids, respectively. In the 5′ untranslated regions (5′ UTRs), both hungaroviruses are predicted to have a type II internal ribosome entry site (IRES). The nucleotide sequence and the secondary RNA structure of the hungarovirus IRES core domains H-I-J-K-L are highly similar to that of human parechovirus (HPeV) (genus Parechovirus), especially HPeV-3. However, in the polyprotein coding region, the amino acid sequences are more closely related to those of porcine teschoviruses (genus Teschovirus). Hungaroviruses were detected in 15% (4/26) and 25% (4/16) of the fecal samples from cattle and sheep, respectively. This report describes the discovery of two novel picornaviruses in farm animals, cattle and sheep. The mosaic genetic pattern raises the possibility that hungaroviruses, human parechoviruses, and porcine teschoviruses may be linked to each other by modular recombination of functional noncoding RNA elements.