scispace - formally typeset
Search or ask a question

Showing papers on "GenBank published in 2014"


Journal ArticleDOI
TL;DR: A concise database for BLAST using a Bio-Edit interface that can detect AR genetic determinants in bacterial genomes and can rapidly and easily discover putative new AR geneticeterminants is created.
Abstract: ARG-ANNOT (Antibiotic Resistance Gene-ANNOTation) is a new bioinformatic tool that was created to detect existing and putative new antibiotic resistance (AR) genes in bacterial genomes. ARG-ANNOT uses a local BLAST program in Bio-Edit software that allows the user to analyze sequences without a Web interface. All AR genetic determinants were collected from published works and online resources; nucleotide and protein sequences were retrieved from the NCBI GenBank database. After building a database that includes 1,689 antibiotic resistance genes, the software was tested in a blind manner using 100 random sequences selected from the database to verify that the sensitivity and specificity were at 100% even when partial sequences were queried. Notably, BLAST analysis results obtained using the rmtF gene sequence (a new aminoglycoside-modifying enzyme gene sequence that is not included in the database) as a query revealed that the tool was able to link this sequence to short sequences (17 to 40 bp) found in other genes of the rmt family with significant E values. Finally, the analysis of 178 Acinetobacter baumannii and 20 Staphylococcus aureus genomes allowed the detection of a significantly higher number of AR genes than the Resfinder gene analyzer and 11 point mutations in target genes known to be associated with AR. The average time for the analysis of a genome was 3.35 ± 0.13 min. We have created a concise database for BLAST using a Bio-Edit interface that can detect AR genetic determinants in bacterial genomes and can rapidly and easily discover putative new AR genetic determinants.

1,016 citations


Journal ArticleDOI
TL;DR: This work describes a further improved and refined version of the M. truncatula genome (Mt4.0) based on de novo whole genome shotgun assembly of a majority of Illumina and 454 reads using ALLPATHS-LG, and re-annotates the genome through the gene prediction pipeline, which integrates EST, RNA-seq, protein and gene prediction evidences.
Abstract: Medicago truncatula, a close relative of alfalfa, is a preeminent model for studying nitrogen fixation, symbiosis, and legume genomics. The Medicago sequencing project began in 2003 with the goal to decipher sequences originated from the euchromatic portion of the genome. The initial sequencing approach was based on a BAC tiling path, culminating in a BAC-based assembly (Mt3.5) as well as an in-depth analysis of the genome published in 2011. Here we describe a further improved and refined version of the M. truncatula genome (Mt4.0) based on de novo whole genome shotgun assembly of a majority of Illumina and 454 reads using ALLPATHS-LG. The ALLPATHS-LG scaffolds were anchored onto the pseudomolecules on the basis of alignments to both the optical map and the genotyping-by-sequencing (GBS) map. The Mt4.0 pseudomolecules encompass ~360 Mb of actual sequences spanning 390 Mb of which ~330 Mb align perfectly with the optical map, presenting a drastic improvement over the BAC-based Mt3.5 which only contained 70% sequences (~250 Mb) of the current version. Most of the sequences and genes that previously resided on the unanchored portion of Mt3.5 have now been incorporated into the Mt4.0 pseudomolecules, with the exception of ~28 Mb of unplaced sequences. With regard to gene annotation, the genome has been re-annotated through our gene prediction pipeline, which integrates EST, RNA-seq, protein and gene prediction evidences. A total of 50,894 genes (31,661 high confidence and 19,233 low confidence) are included in Mt4.0 which overlapped with ~82% of the gene loci annotated in Mt3.5. Of the remaining genes, 14% of the Mt3.5 genes have been deprecated to an “unsupported” status and 4% are absent from the Mt4.0 predictions. Mt4.0 and its associated resources, such as genome browsers, BLAST-able datasets and gene information pages, can be found on the JCVI Medicago web site ( http://www.jcvi.org/medicago ). The assembly and annotation has been deposited in GenBank (BioProject: PRJNA10791). The heavily curated chromosomal sequences and associated gene models of Medicago will serve as a better reference for legume biology and comparative genomics.

373 citations


Journal ArticleDOI
31 Dec 2014-Mbio
TL;DR: This study confirmed known misidentifications, validated the recent revisions in the nomenclature, and revealed that a number of genomes deposited in GenBank are misnamed.
Abstract: Prokaryotic taxonomy is the underpinning of microbiology, as it provides a framework for the proper identification and naming of organisms. The “gold standard” of bacterial species delineation is the overall genome similarity determined by DNA-DNA hybridization (DDH), a technically rigorous yet sometimes variable method that may produce inconsistent results. Improvements in next-generation sequencing have resulted in an upsurge of bacterial genome sequences and bioinformatic tools that compare genomic data, such as average nucleotide identity (ANI), correlation of tetranucleotide frequencies, and the genome-to-genome distance calculator, or in silico DDH (isDDH). Here, we evaluate ANI and isDDH in combination with phylogenetic studies using Aeromonas, a taxonomically challenging genus with many described species and several strains that were reassigned to different species as a test case. We generated improved, high-quality draft genome sequences for 33 Aeromonas strains and combined them with 23 publicly available genomes. ANI and isDDH distances were determined and compared to phylogenies from multilocus sequence analysis of housekeeping genes, ribosomal proteins, and expanded core genes. The expanded core phylogenetic analysis suggested relationships between distant Aeromonas clades that were inconsistent with studies using fewer genes. ANI values of ≥96% and isDDH values of ≥70% consistently grouped genomes originating from strains of the same species together. Our study confirmed known misidentifications, validated the recent revisions in the nomenclature, and revealed that a number of genomes deposited in GenBank are misnamed. In addition, two strains were identified that may represent novel Aeromonas species. IMPORTANCE Improvements in DNA sequencing technologies have resulted in the ability to generate large numbers of high-quality draft genomes and led to a dramatic increase in the number of publically available genomes. This has allowed researchers to characterize microorganisms using genome data. Advantages of genome sequence-based classification include data and computing programs that can be readily shared, facilitating the standardization of taxonomic methodology and resolving conflicting identifications by providing greater uniformity in an overall analysis. Using Aeromonas as a test case, we compared and validated different approaches. Based on our analyses, we recommend cutoff values for distance measures for identifying species. Accurate species classification is critical not only to obviate the perpetuation of errors in public databases but also to ensure the validity of inferences made on the relationships among species within a genus and proper identification in clinical and veterinary diagnostic laboratories.

183 citations


Journal ArticleDOI
TL;DR: Analysis of gene, repetitive element, and segmental duplication content show this assembly to be of excellent quality and contiguity, however, comparison to assembly-independent resources, such as BAC clone end sequences and PacBio long reads, indicate misassembled regions.
Abstract: The production of a reference sequence assembly for the human genome was a milestone in biology and clearly has impacted many areas of biomedical research (McPherson et al. 2001; International Human Genome Sequencing 2004). The availability of this resource allows us to investigate genomic structure and variation at a depth previously unavailable (Kidd et al. 2008; The 1000 Genomes Project Consortium 2012). These studies have helped make clear the shortcomings of our initial assembly models and the difficulty of comprehensive genome analysis. While the current human reference assembly is of extremely high quality and is still the benchmark by which all other human assemblies must be compared, it is far from perfect. Technical and biological complexity lead to both missing sequences as well as misassembled sequence in the current reference, GRCh38 (Robledo et al. 2002; Eichler et al. 2004; International Human Genome Sequencing 2004; Church et al. 2011; Genovese et al. 2013). The two most vexing biological problems affecting assembly are (1) complex genomic architecture seen in large regions with highly homologous duplicated sequences and (2) excess allelic diversity (Bailey et al. 2001; Mills et al. 2006; Korbel et al. 2007; Kidd et al. 2008; Zody et al. 2008). Assembling these regions is further complicated due to the fact that regions of segmental duplication (SD) are often correlated with copy-number variants (CNVs) (Sharp et al. 2005). Regions harboring large CNV SDs have been misrepresented in the reference assembly because assembly algorithms aim to produce a haploid consensus. Highly identical paralogous and structurally polymorphic regions frequently lead to nonallelic sequences being collapsed into a single contig or allelic sequences being improperly represented as duplicates. Because of this complexity, a single, haploid reference is insufficient to fully represent human diversity (Church et al. 2011). The availability of at least one accurate allelic representation at loci with complex genomic architecture facilitates the understanding of the genomic architecture and diversity in these regions (Watson et al. 2013). To enable the assembly of these regions, we have developed a suite of resources from CHM1, a DNA source containing a single human haplotype (Taillon-Miller et al. 1997; Fan et al. 2002). A complete hydatidiform mole (CHM) is an abnormal product of conception in which there is a very early fetal demise and overgrowth of the placental tissue. Most CHMs are androgenetic and contain only paternally derived autosomes and sex chromosomes resulting either from dispermy or duplication of a single sperm genome. The phenotype is thought to be a result of abnormal parental contribution leading to aberrant genomic imprinting (Hoffner and Surti 2012). The absence of allelic variation in monospermic CHM makes it an ideal candidate for producing a single haplotype representation of the human genome. There are a number of existing resources associated with the “CHM1” sample, including a BAC library with end sequences generated with Sanger sequencing using ABI 3730 technology (https://bacpac.chori.org/), an optical map (Teague et al. 2010), and a BioNano genomic map (see Data access), some of which have previously been used to improve regions of the reference human genome assembly. BAC clones have historically been used to resolve difficult genomic regions and identify structural variants (Barbouti et al. 2004; Carvalho and Lupski 2008). A BAC library constructed from CHM1 DNA (CHORI-17, CH17) has also been utilized to resolve several very difficult genomic regions, including human-specific duplications at the SRGAP2 gene family on Chromosome 1 (Dennis et al. 2012). Additionally, the CHM1 BAC clones were used to generate single haplotype assemblies of regions that were previously misrepresented because of haplotype mixing (Watson et al. 2013). Both of these efforts contributed to the improvement of the GRCh38 reference human genome assembly, adding hundreds of kilobases of sequence missing in GRCh37, in addition to providing an accurate single haplotype representation of complex genome regions. Because of the previously established utility of sequence data derived from the CHM1 resource, we wished to develop a complete assembly of a single human haplotype. To this end, we produced a short read-based (Illumina) reference-guided assembly of CHM1 with integrated high-quality finished fully sequenced BAC clones to further improve the assembly. This assembly has been annotated using the NCBI annotation process and has been aligned to other human assemblies in GenBank, including both GRCh37 and GRCh38. Here we present evidence that the CHM1 genome assembly is a high-quality draft with respect to gene and repetitive element content as well as by comparison to other individual genome assemblies. We will also discuss current plans for developing a fully finished genome assembly based on this resource.

132 citations


Journal ArticleDOI
TL;DR: This study establishes that particularly mitochondrial 5S rRNA has a much broader taxonomic distribution and a much larger structural variability than previously thought.
Abstract: 5S Ribosomal RNA (5S rRNA) is a universal component of ribosomes, and the corresponding gene is easily identified in archaeal, bacterial and nuclear genome sequences. However, organelle gene homologs (rrn5) appear to be absent from most mitochondrial and several chloroplast genomes. Here, we re-examine the distribution of organelle rrn5 by building mitochondrion- and plastid-specific covariance models (CMs) with which we screened organelle genome sequences. We not only recover all organelle rrn5 genes annotated in GenBank records, but also identify more than 50 previously unrecognized homologs in mitochondrial genomes of various stramenopiles, red algae, cryptomonads, malawimonads and apusozoans, and surprisingly, in the apicoplast (highly derived plastid) genomes of the coccidian pathogens Toxoplasma gondii and Eimeria tenella. Comparative modeling of RNA secondary structure reveals that mitochondrial 5S rRNAs from brown algae adopt a permuted triskelion shape that has not been seen elsewhere. Expression of the newly predicted rrn5 genes is confirmed experimentally in 10 instances, based on our own and published RNA-Seq data. This study establishes that particularly mitochondrial 5S rRNA has a much broader taxonomic distribution and a much larger structural variability than previously thought. The newly developed CMs will be made available via the Rfam database and the MFannot organelle genome annotator.

100 citations


Journal ArticleDOI
10 Apr 2014-PLOS ONE
TL;DR: The transcriptome provides an invaluable new data for a functional genomics resource and future biological research in Portunus trituberculatus, and will also instruct future functional studies to manipulate or select for genes influencing growth that should find practical applications in aquaculture breeding programs.
Abstract: Background The swimming crab, Portunus trituberculatus, is an important farmed species in China, has been attracting extensive studies, which require more and more genome background knowledge. To date, the sequencing of its whole genome is unavailable and transcriptomic information is also scarce for this species. In the present study, we performed de novo transcriptome sequencing to produce a comprehensive transcript dataset for major tissues of Portunus trituberculatus by the Illumina paired-end sequencing technology. Results Total RNA was isolated from eyestalk, gill, heart, hepatopancreas and muscle. Equal quantities of RNA from each tissue were pooled to construct a cDNA library. Using the Illumina paired-end sequencing technology, we generated a total of 120,137 transcripts with an average length of 1037 bp. Further assembly analysis showed that all contigs contributed to 87,100 unigenes, of these, 16,029 unigenes (18.40% of the total) can be matched in the GenBank non-redundant database. Potential genes and their functions were predicted by GO, KEGG pathway mapping and COG analysis. Based on our sequence analysis and published literature, many putative genes with fundamental roles in growth and muscle development, including actin, myosin, tropomyosin, troponin and other potentially important candidate genes were identified for the first time in this specie. Furthermore, 22,673 SSRs and 66,191 high-confidence SNPs were identified in this EST dataset. Conclusion The transcriptome provides an invaluable new data for a functional genomics resource and future biological research in Portunus trituberculatus. The data will also instruct future functional studies to manipulate or select for genes influencing growth that should find practical applications in aquaculture breeding programs. The molecular markers identified in this study will provide a material basis for future genetic linkage and quantitative trait loci analyses, and will be essential for accelerating aquaculture breeding programs with this species.

75 citations


Journal ArticleDOI
Cancan Cheng1, Jingjing Sun1, Fen Zheng1, Kuihai Wu1, Yongyu Rui1 
TL;DR: The results show the importance of 16S rDNA and internal ITS2 sequencing for the molecular identification of “difficult-to-identify” bacteria and fungi and will make it worth extending into clinical practice in developing countries.
Abstract: Clinical microbiology laboratories have to accurately identify clinical microbes. However, there can be failure in the identification of certain bacteria or fungi by phenotypic criteria sometimes. Therefore, the ability of 16S ribosomal DNA (16S rDNA) and internal transcribed spacer 2 (ITS2) sequencing to identify these “difficult-to-identify” bacteria and fungi was assessed in this study. Samples obtained from a teaching hospital over the past three years were examined. The 16S rDNA of four standard strains, 18 clinical common isolates, and 47 “difficult-to-identify” clinical bacteria were amplified by PCR and sequenced. The ITS2 of eight standard strains and 31 “difficult-to-identify” clinical fungi were also amplified by PCR and sequenced. The sequences of 16S rDNA and ITS2 were compared to reference data available in GenBank by using the BLASTN program. These microbes were identified according to the percentage of similarity to reference sequences of strains in GenBank. The results from molecular sequencing methods correlated well with automated microbiological identification systems for common clinical isolates. Sequencing results of the standard strains were consistent with their known phenotype. Overall, 47 “difficult-to-identify” clinical bacteria were identified as 35 genera or species by sequence analysis (with 10 of these identified isolates first reported in clinical specimens in China and two first identified in the international literature). 31 “difficult-to-identify” clinical fungi tested could be identified as 15 genera or species by sequence analysis (with two of these first reported in China). Our results show the importance of 16S rDNA and internal ITS2 sequencing for the molecular identification of “difficult-to-identify” bacteria and fungi. The development of this method with advantages of convenience, availability, and cost-effectiveness will make it worth extending into clinical practice in developing countries.

74 citations


Journal ArticleDOI
TL;DR: GenBank represents the main source of errors for identifying Saprolegnia species since it possesses sequences with misassigned names and also sequencing errors, which might help setting the basis for a suitable identification of species in this economically important genus.

59 citations


Journal ArticleDOI
TL;DR: The genus Streptomyces is now one of the most highly sequenced, with 19 finished genomic sequences and a further 125 draft assemblies available in the GenBank database as of 3rd of May 2014; by the time this is published, no doubt there will be more.
Abstract: Many readers of this journal will need no introduction to the bacterial genus Streptomyces, which includes several hundred species, many of which produce biotechnologically useful secondary metabolites. The last 2 years have seen numerous publications describing Streptomyces genome sequences (Table ​(Table1),1), mostly as short genome announcements restricted to just 500 words and therefore allowing little description and analysis. Our aim in this current manuscript is to survey these recent publications and to dig a little deeper where appropriate. The genus Streptomyces is now one of the most highly sequenced, with 19 finished genomic sequences (Table ​(Table2)2) and a further 125 draft assemblies available in the GenBank database as of 3rd of May 2014; by the time this is published, no doubt there will be more. The reasons given for sequencing this latest crop of Streptomyces include production of industrially important enzymes, degradation of lignin, antibiotic production, rapid growth and halo-tolerance and an endophytic lifestyle (Table ​(Table11). Table 1 Recent genome publications (2013 and 2014) for Streptomyces species Table 2 Completely sequenced Streptomyces species genome sequences available in GenBank as of 29 April 2014 Mining genomes for secondary metabolism gene clusters Given the strong emphasis on secondary metabolism in Streptomyces genomics research, it is timely that version 2.0 of antiSMASH has been released and published (Blin et al., 2013). This computational tool has become a de facto standard for mining secondary metabolism gene clusters in genome sequences. Version 2.0 is completely revamped and, significantly, can now be used with highly fragmented draft-quality genome sequences whereas the previous version only worked well with finished genomes. Clearly, this is of immense importance to the discovery of novel metabolites in the ever-expanding database of streptomycete draft-quality genome sequences. For example, antiSMASH 2.0 analysis of the Streptomyces roseochromogenes subsp. oscitans DS 12.976 genome sequence revealed 43 new gene clusters in addition to recovering the already known clorobiocin gene cluster (Ruckert et al., 2014). The genome sequence of Streptomyces gancidicus strain BKS 13–15 was published before antiSMASH 2.0 became available. The authors state that seven genes mapped on to the streptomycin biosynthesis pathway based on gene-by-gene sequence similarities (Kumar et al., 2013) against homologues of genes in KEGG pathways (Kanehisa et al., 2012). However, we found no bioinformatic evidence for a streptomycin biosynthesis pathway encoded in this genome, although our antiSMASH 2.0 search did find 38 putative gene clusters. In common with many other pathways for secondary metabolism, genes for production of the aminoglycoside streptomycin are organized into a cluster of contiguous genes. The nucleotide sequences of at least two such clusters are available (GenBank accessions {"type":"entrez-nucleotide","attrs":{"text":"GU384160","term_id":"288549237","term_text":"GU384160"}}GU384160 and {"type":"entrez-nucleotide","attrs":{"text":"AJ862840","term_id":"62896300","term_text":"AJ862840"}}AJ862840 from Streptomyces platensis and Streptomyces griseus respectively). Our blastn searches (using these two cluster sequences as queries) failed to detect a complete streptomycin gene cluster in the S. gancidicus genome, but there were some regions of sequence similarity on a 111 kb contig (GenBank: {"type":"entrez-nucleotide","attrs":{"text":"AOHP01000057","term_id":"455649599","term_text":"AOHP01000057"}}AOHP01000057). An antiSMASH 2.0 search failed to find any aminoglycoside biosynthetic cluster in this genome. We are not aware of any experimental evidence that this strain produces the aminoglycoside streptomycin and conclude that these seven genes highlighted by the authors (Kumar et al., 2013) most probably encode components of another, perhaps novel, pathway. This illustrates the value of the antiSMASH 2.0 tool, which has the potential to discover new pathways, rather than relying on similarity to the pathways already represented in the KEGG database (and therefore, by definition, not novel). The case of Streptomyces species strain Mg1 (Hoefler et al., 2013) illustrates another consideration when mining bacterial genome sequences for secondary metabolism gene clusters. Many of the recently published Streptomyces genome sequences are assembled from massively parallel sequencing platforms such as 454 GS-FLX and Illumina HiSeq. The short sequence reads (typically less than 450 bp) and relatively high error rates associated with these platforms can lead to rather fragmented and/or incomplete genome assemblies. The situation is not helped by the biased sequence composition (approximately 70% G + C) of Streptomyces DNA. Furthermore, non-ribosomal peptide synthases (NRPS) and polyketide synthetases (PK) are long, modular proteins made up of many repeated domain units. This means that the genes encoding these key enzymes can be particularly difficult to assemble accurately from short sequence reads. To overcome this issue, the authors of the Mg1 genome project (Hoefler et al., 2013) exploited the PacBio SMRT sequencing technology, which provides sequences reads of several Kb in length, meaning that an entire PK or NRPS gene could be represented on a single sequence read, thus avoiding the difficulties of assembling repetitive sequence from short fragments. They also generated an assembly of the same genome based on 454 GS-FLX and Illumina HiSeq. The results were striking: more than 90% of the genome was represented in a single contig of 7.8 Mb in the PacBio-based assembly and the PacBio-based assembly was 19.9% longer than the 454/Illumina-based one (8 705 754 versus 7 260 368 bp). As the authors point out, this implies that more than 1 Mb of sequence in the PacBio-based assembly is missing from the 454/Illumina-based one, as can be seen in Fig. ​Fig.1A.1A. However, the 454/Illumina-based assembly is not simply a subset of the PacBio-based one; as illustrated in Fig. ​Fig.1B,1B, a substantial portion of the 454/Illumina-based assembly is missing from the PacBio assembly. Although it is by no means certain which assembly is more ‘correct’, it might be possible to generate a more complete genome assembly by reconciling the two different assemblies. Figure 1 Comparison of two different genome assemblies for Streptomyces strain Mg1, one based on PacBio sequence data and the other based on 454 and Illumina sequence data. A illustrates alignment of both the assemblies against the PacBio-based assembly. B illustrates ... Fragmentation and incompleteness of a genome assembly has implications for discovery of secondary metabolism gene clusters. In Fig. ​Fig.1C,1C, we show a putative NRPS gene cluster detected apparently intact in a single contig of the PacBio-based sequence assembly identified by antiSMASH 2.0. Searching the 454/Illumina-based assembly reveals two incomplete fragments of the gene cluster, lying on two different contigs, and with part of the cluster apparently absent. Although we should be cautious about extrapolating too much from this single anecdotal example, the evidence suggests that longer read lengths can be very valuable in genome mining for secondary metabolism clusters.

57 citations


Journal ArticleDOI
25 Nov 2014-PLOS ONE
TL;DR: Using consistent annotation across all sequenced bacterial species from GenBank and other sources via RAST, and available from the PATRIC (Pathogenic Resource Integration Center) platform, the data for currently annotated reverse transcriptases from completely sequencedacterial genomes is compiled.
Abstract: Much less is known about reverse transcriptases (RTs) in prokaryotes than in eukaryotes, with most prokaryotic enzymes still uncharacterized. Two surveys involving BLAST searches for RT genes in prokaryotic genomes revealed the presence of large numbers of diverse, uncharacterized RTs and RT-like sequences. Here, using consistent annotation across all sequenced bacterial species from GenBank and other sources via RAST, available from the PATRIC (Pathogenic Resource Integration Center) platform, we have compiled the data for currently annotated reverse transcriptases from completely sequenced bacterial genomes. RT sequences are broadly distributed across bacterial phyla, but green sulfur bacteria and cyanobacteria have the highest levels of RT sequence diversity (≤85% identity) per genome. By contrast, phylum Actinobacteria, for which a large number of genomes have been sequenced, was found to have a low RT sequence diversity. Phylogenetic analyses revealed that bacterial RTs could be classified into 17 main groups: group II introns, retrons/retron-like RTs, diversity-generating retroelements (DGRs), Abi-like RTs, CRISPR-Cas-associated RTs, group II-like RTs (G2L), and 11 other groups of RTs of unknown function. Proteobacteria had the highest potential functional diversity, as they possessed most of the RT groups. Group II introns and DGRs were the most widely distributed RTs in bacterial phyla. Our results provide insights into bacterial RT phylogeny and the basis for an update of annotation systems based on sequence/domain homology.

54 citations


Journal ArticleDOI
TL;DR: ARBitrator rapidly updates a public nifH sequence database, and it is shown that it can be adapted for other genes, as well as used for screening with a best hit strategy to conserved domains.
Abstract: © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com. MOTIVATION: Studies of the biochemical functions and activities of uncultivated microorganisms in the environment require analysis of DNA sequences for phylogenetic characterization and for the development of sequence-based assays for the detection of microorganisms. The numbers of sequences for genes that are indicators of environmentally important functions such as nitrogen (N2) fixation have been rapidly growing over the past few decades. Obtaining these sequences from the National Center for Biotechnology Information's GenBank database is problematic because of annotation errors, nomenclature variation and paralogues; moreover, GenBank's structure and tools are not conducive to searching solely by function. For some genes, such as the nifH gene commonly used to assess community potential for N2 fixation, manual collection and curation are becoming intractable because of the large number of sequences in GenBank and the large number of highly similar paralogues. If analysis is to keep pace with sequence discovery, an automated retrieval and curation system is necessary. RESULTS: ARBitrator uses a two-step process composed of a broad collection of potential homologues followed by screening with a best hit strategy to conserved domains. 34 420 nifH sequences were identified in GenBank as of November 20, 2012. The false-positive rate is ∼0.033%. ARBitrator rapidly updates a public nifH sequence database, and we show that it can be adapted for other genes. AVAILABILITY AND IMPLEMENTATION: Java source and executable code are freely available to non-commercial users at http://pmc.ucsc.edu/∼wwwzehr/research/database/. CONTACT: zehrj@ucsc.edu SUPPLEMENTARY INFORMATION: SUPPLEMENTARY INFORMATION is available at Bioinformatics online.

Journal ArticleDOI
TL;DR: Comparison analyses based on whole genomes support the opinion that the new isolates are relatively distantly related to any known subgroups of ALVs and might be classified as a new subgroup.
Abstract: The strain JS11C1, a member of a putative new subgroup of avian leukosis virus (ALV) that is different from all six known subgroups from chickens based on Gp85 amino acid sequence comparison, was isolated from Chinese native chicken breeds in 2012. In order to further study the genome structure, biological characteristics, and the evolutionary relationship of the virus with others of known subgroups from infected chickens, we determined the complete genome sequence, constructed an infectious clone of ALV strain JS11C1, and performed comparative analysis using the whole genome sequence or elements with that of other ALVs available in GenBank. The results showed that the full-length sequence of the JS11C1 DNA provirus genome was 7707 bp, which is consistent with a genetic organization typical of a replication-competent type C retrovirus lacking viral oncogenes. The rescued infectious clone of JS11C1 showed similar growth rate and biological characteristics to its original virus. All the comparison analyses based on whole genomes support the opinion that the new isolates are relatively distantly related to any known subgroups of ALVs and might be classified as a new subgroup.

Journal ArticleDOI
TL;DR: A well-developed, powerful and comparative computational approach, EST-based homology search is applied to find potential miRNAs of coffee and for the first time, one potential miRNA from a large miRNA family with appropriate fold back structures was identified through a series of filtration criteria.

Journal ArticleDOI
TL;DR: The score can be used to set thresholds for screening data when analyzing “all published genomes” and reference data is either not available or not applicable and the scores highlighted organisms for which commonly used tools do not perform well.
Abstract: Background: More than 80% of the microbial genomes in GenBank are of ‘draft’ quality (12,553 draft vs. 2,679 finished, as of October, 2013). We have examined all the microbial DNA sequences available for complete, draft, and Sequence Read Archive genomes in GenBank as well as three other major public databases, and assigned quality scores for more than 30,000 prokaryotic genome sequences. Results: Scores were assigned using four categories: the completeness of the assembly, the presence of full-length rRNA genes, tRNA composition and the presence of a set of 102 conserved genes in prokaryotes. Most (~88%) of the genomes had quality scores of 0.8 or better and can be safely used for standard comparative genomics analysis. We compared genomes across factors that may influence the score. We found that although sequencing depth coverage of over 100x did not ensure a better score, sequencing read length was a better indicator of sequencing quality. With few exceptions, most of the 30,000 genomes have nearly all the 102 essential genes. Conclusions: The score can be used to set thresholds for screening data when analyzing “all published genomes” and reference data is either not available or not applicable. The scores highlighted organisms for which commonly used tools do not perform well. This information can be used to improve tools and to serve a broad group of users as more diverse organisms are sequenced. Unexpectedly, the comparison of predicted tRNAs across 15,000 high quality genomes showed that anticodons beginning with an ‘A’ (codons ending with a ‘U’ )a re almost non-existent, with the exception of one arginine codon (CGU); this has been noted previously in the literature for a few genomes, but not with the depth found here.

Journal ArticleDOI
25 Mar 2014
TL;DR: Of the 39 species of ticks, 36 species (92.3%) were distinguishable by phylogenetic analysis of mt-rrs, suggesting that species identification of ticks based on mt- rrs is a viable alternative to morphological identification.
Abstract: Tick identification is important in control of tick-born diseases because tick-borne pathogens are often transmitted by specific tick species. In this study, we determined partial DNA sequences of the mitochondrial 16S rDNA gene (mt-rrs ) for ticks including 7 genera and 39 species, and these ticks were allocated to 113 sequence types. Of the 39 species of ticks, 36 species (92.3%) were distinguishable by phylogenetic analysis of mt-rrs . This result suggests that species identification of ticks based on mt-rrs is a viable alternative to morphological identification. In order to establish a DNA database for identification of ixodid and argasid ticks in Japan, we deposited all sequence data in GenBank (from AB819156 to AB819268).

Journal ArticleDOI
TL;DR: Circleator is a Perl application that generates circular figures of genome-associated data that leverages BioPerl to support standard annotation and sequence file formats and produces publication-quality SVG output.
Abstract: Summary: Circleator is a Perl application that generates circular figures of genome-associated data. It leverages BioPerl to support standard annotation and sequence file formats and produces publication-quality SVG output. It is designed to be both flexible and easy to use. It includes a library of circular track types and predefined configuration files for common use-cases, including. (i) visualizing gene annotation and DNA sequence data from a GenBank flat file, (ii) displaying patterns of gene conservation in related microbial strains, (iii) showing Single Nucleotide Polymorphisms (SNPs) and indels relative to a reference genome and gene set and (iv) viewing RNA-Seq plots.

Journal ArticleDOI
TL;DR: The development of 240 novel EST-SSR loci for the important tree genus Eucalyptus L’Hérit would be a valuable addition of functional markers for genetics and breeding applications in a wide range of eucalyPT species.
Abstract: Simple sequence repeat (SSR) markers derived from expressed sequence tag (EST) resources provide great potential for comparative mapping, direct gene tagging of quantitative trait loci and functional diversity studies. Here we report on the development of 240 novel EST-SSRs for the important tree genus Eucalyptus L’Herit. Of the 240 EST-SSR loci, 218 (90.8 %) were polymorphic among 12 individuals of E. grandis Hill ex Maiden, with the number of alleles per locus (Na), observed heterozygosity (Ho), expected heterozygosity (He) and polymorphic information content (PIC) averaging at 5.0, 0.403, 0.598 and 0.529, respectively. High rates of cross-species/subgenus amplification were observed. The EST-SSRs developed herein would be a valuable addition of functional markers for genetics and breeding applications in a wide range of eucalypt species. The primer sequences for the 240 EST-SSRs have been deposited in the Probe database of GenBank (IDs Pr016588534–773).

Journal ArticleDOI
24 Feb 2014
TL;DR: The Plant Secretome and Subcellular Proteome KnowledgeBase (PlantSecKB) is developed for the plant research community to access and curate plant protein subcellular locations, with a focus on secreted proteins.
Abstract: Normal 0 MicrosoftInternetExplorer4 Normal 0 MicrosoftInternetExplorer4 Prediction and curation of protein subcellular locations is essential for protein functional annotation. We developed the Plant Secretome and Subcellular Proteome KnowledgeBase (PlantSecKB) for the plant research community to access and curate plant protein subcellular locations, with a focus on secreted proteins. The database is constructed with all the available plant protein data retrieved from the UniProtKB database and plant protein sequences predicted from EST data assembled by the PlantGDB project. The database contains information collected from three sources: (1) subcellular locations that were curated or computationally predicted in the UniProtKB; (2) subcellular locations and features predicted by eight computational tools; (3) secreted proteins that were curated from recent literature. The categories of subcellular locations include secretome, mitochondria, chloroplast, cytosol, cytoskeleton, endoplasmic reticulum, Golgi apparatus, lysosome, peroxisome, nucleus, vacuole, and plasma membrane. The data can be searched by using UniProt accession number or ID, GenBank GI or RefSeq accession number, gene name, and keywords. Species specific secretome and subcellular proteomes can be searched and downloaded into a FASTA file. BLAST is available to allow users to search the database based on protein sequences. Community curation for subcellular locations of plant proteins is also supported. A primary analysis revealed that monocots and dicots had a similar proportion of secretomes, and monocots had a significantly higher proportion of proteins distributed to mitochondria (both membrane and non-membrane) and chloroplast membrane, while dicots had significantly more proteins distributed to cytosol and nucleus. This database aims to facilitate plant protein research and is available at http://proteomics.ysu.edu/secretomes/plant.php. /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman";} /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman";}

Journal ArticleDOI
01 Oct 2014-Gene
TL;DR: The expression analysis of Dreb2 under normal and different levels of dehydration stress conditions indicated that the two active spliced isoforms are upregulated when the plant exposed to drought stress whereas the non-functional isoform is downregulated in severe drought.

Journal ArticleDOI
17 Jan 2014-Agronomy
TL;DR: A draft genome sequence for enset (Ensete ventricosum) is presented, suggesting a genome size similar to the 523-megabase genome of the closely related banana, and contains genes not present in banana, including reverse transcriptases and virus-like sequences.
Abstract: We present a draft genome sequence for enset (Ensete ventricosum) available via the Sequence Read Archive (accession number SRX202265) and GenBank (accession number AMZH01. Enset feeds 15 million people in Ethiopia, but is arguably the least studied African crop. Our sequence data suggest a genome size of approximately 547 megabases, similar to the 523-megabase genome of the closely related banana (Musa acuminata). At least 1.8% of the annotated M. acuminata genes are not conserved in E. ventricosum. Furthermore, enset contains genes not present in banana, including reverse transcriptases and virus-like sequences as well as a homolog of the RPP8-like resistance gene. We hope that availability of genome-wide sequence data will stimulate and accelerate research on this important but neglected crop.

Journal ArticleDOI
TL;DR: A proteogenomic analysis is presented for Venturia pirina, a fungus that causes scab disease on European pear, which identifies 1088 distinct V. pirina protein groups including 1085 detected for the first time.
Abstract: A proteogenomic analysis is presented for Venturia pirina, a fungus that causes scab disease on European pear (Pyrus communis). V. pirina is host-specific, and the infection is thought to be mediated by secreted effector proteins. Currently, only 36 V. pirina proteins are catalogued in GenBank, and the genome sequence is not publicly available. To identify putative effectors, V. pirina was grown in vitro on and in cellophane sheets mimicking its growth in infected leaves. Secreted extracts were analyzed by tandem mass spectrometry, and the data (ProteomeXchange identifier PXD000710) was queried against a protein database generated by combining in silico predicted transcripts with six frame translations of a whole genome sequence of V. pirina (GenBank Accession JEMP00000000 ). We identified 1088 distinct V. pirina protein groups (FDR 1%) including 1085 detected for the first time. Thirty novel (not in silico predicted) proteins were found, of which 14 were identified as potential effectors based on characteristic features of fungal effector protein sequences. We also used evidence from semitryptic peptides at the protein N-terminus to corroborate in silico signal peptide predictions for 22 proteins, including several potential effectors. The analysis highlights the utility of proteogenomics in the study of secreted effectors.

Posted Content
TL;DR: This article conducted a phylogenetic survey of these complete or near-complete mtDNA sequences based on mtDNA haplogroup trees for eight common domestic animals (i.e. cattle, dog, goat, horse, pig, sheep, yak and chicken) and their close wild ancestors or relatives.
Abstract: More than 1000 complete or near-complete mitochondrial DNA (mtDNA) sequences have been deposited in GenBank for eight common domestic animals (i.e. cattle, dog, goat, horse, pig, sheep, yak and chicken) and their close wild ancestors or relatives. Nevertheless, few efforts have been performed to evaluate the sequence data quality, which heavily impact the original conclusion. Herein, we conducted a phylogenetic survey of these complete or near-complete mtDNA sequences based on mtDNA haplogroup trees for the eight animals. We show that, errors due to artificial recombination, surplus of mutations, and phantom mutations, do exist in 14.5% (194/1342) of mtDNA sequences and shall be treated with wide caution. We propose some caveats for mtDNA studies of domestic animals in the future.

Journal ArticleDOI
TL;DR: This study provides robust reference sequences for sequence-based identification of Bjerkandera, and further demonstrates the presence and dangers of incorrect sequences in GenBank.
Abstract: White-rot fungi of the genus Bjerkandera are cosmopolitan and have shown potential for industrial application and bioremediation. When distinguishing morphological characters are no longer present (e.g., cultures or dried specimen fragments), characterizing true sequences of Bjerkandera is crucial for accurate identification and application of the species. To build a framework for molecular identification of Bjerkandera, we carefully identified specimens of B. adusta and B. fumosa from Korea based on morphological characters, followed by sequencing the internal transcribed spacer region and 28S nuclear ribosomal large subunit. The phylogenetic analysis of Korean Bjerkandera specimens showed clear genetic differentiation between the two species. Using this phylogeny as a framework, we examined the identification accuracy of sequences available in GenBank. Analyses revealed that many Bjerkandera sequences in the database are either misidentified or unidentified. This study provides robust reference sequences for sequence-based identification of Bjerkandera, and further demonstrates the presence and dangers of incorrect sequences in GenBank.

Journal ArticleDOI
TL;DR: This study indicated that Ts-Pt was classified as a somatic protein in different T. spiralis developmental stages, and demonstrated for the first time that an expressed DNase II protein from T. Spiralis had nuclease activity.
Abstract: Background Deoxyribonuclease II (DNase II) is a well-known acidic endonuclease that catalyses the degradation of DNA into oligonucleotides. Only one or a few genes encoding DNase II have been observed in the genomes of many species. 125 DNase II-like protein family genes were predicted in the Trichinella spiralis (T. spiralis) genome; however, none have been confirmed. DNase II is a monomeric nuclease that contains two copies of a variant HKD motif in the N- and C-termini. Of these 125 genes, only plancitoxin-1 (1095 bp, GenBank accession no. XM_003370715.1) contains the HKD motif in its C-terminus domain. Methodology/Principal Findings In this study, we cloned and characterised the plancitoxin-1 gene. However, the sequences of plancitoxin-1 cloned from T. spiralis were shorter than the predicted sequences in GenBank. Intriguingly, there were two HKD motifs in the N- and C-termini in the cloned sequences. Therefore, the gene with shorter sequences was named after plancitoxin-1-like (Ts-Pt, 885 bp) and has been deposited in GenBank under accession number KF984291. The recombinant protein (rTs-Pt) was expressed in a prokaryotic expression system and purified by nickel affinity chromatography. Western blot analysis showed that rTs-Pt was recognised by serum from T. spiralis-infected mice; the anti-rTs-Pt serum recognised crude antigens but not ES antigens. The Ts-Pt gene was examined at all T. spiralis developmental stages by real-time quantitative PCR. Immunolocalisation analysis showed that Ts-Pt was distributed throughout newborn larvae (NBL), the tegument of adults (Ad) and muscle larvae (ML). As demonstrated by DNase zymography, the expressed proteins displayed cation-independent DNase activity. rTs-Pt had a narrow optimum pH range in slightly acidic conditions (pH 4 and pH 5), and its optimum temperature was 25°C, 30°C, and 37°C. Conclusions This study indicated that Ts-Pt was classified as a somatic protein in different T. spiralis developmental stages, and demonstrated for the first time that an expressed DNase II protein from T. spiralis had nuclease activity.

Journal ArticleDOI
TL;DR: Estimated perfect EST-SSRs have been identified and can be used for developing ESS-SSR-based detection tool for durum wheat in future studies and will be a useful resource for molecular breeding, genetics, genomics, and environmental stress studies.
Abstract: The goal of this study is to identify characterization of expressed sequence tag (EST)-simple sequence repeats (SSR) markers from EST library of durum wheat and functional analysis of SSR-containing EST sequences for application in comparative genomics and breeding. 19,141 sequences were analyzed among which 18,937 ESTs were selected. Consistent with MISA results, 313 EST-SSRs were yielded. The final EST-SSRs were compared to the GenBank non-redundant database using BLASTX and classified based on these functions. Results indicated that the perfect EST-SSRs are the most frequent. The TTG/CTG imperfect EST-SSR had gamma-gliadin putative function that can be appropriate for durum wheat. Also, the mononucleotides and trinucleotides were the most frequent. Findings suggested that the identified EST-SSRs could be categorized into 83 types. Motifs TTG in trinucleotides and TC in dinucleotides had the highest frequency. TTG is the new motif in durum wheat identified in this study. We identified new EST-SSRs with more than trinucleotide and detected motifs that have potential to code amino acids. Arginine was the most frequent amino acid. Enzymes had the highest frequency among predicted functions. EST-SSRs have been identified in this study can be used for developing ESS-SSR-based detection tool for durum wheat in future studies and will be a useful resource for molecular breeding, genetics, genomics, and environmental stress studies. Motifs coding amino acids could be used as a new source of functional markers and biological study. In addition to, designed new PCR primer pairs are new resources for to identify useful alleles in transcription factors, storage proteins, and enzymes which incorporated them again into the cultivated material.

Journal ArticleDOI
TL;DR: Several individuals of the Caribbean Zamia clade and other cycad genera were used to identify single‐copy nuclear genes for phylogeographic and phylogenetic studies in Cycadales and 29 loci were successfully amplified as a single band of which 20 were likely single‐ copy loci.
Abstract: Several individuals of the Caribbean Zamia clade and other cycad genera were used to identify single-copy nuclear genes for phylogeographic and phylogenetic studies in Cycadales. Two strategies were employed to select target loci: (i) a tblastX search of Arabidopsis conserved ortholog sequence (COS) set and (ii) a tblastX search of Arabidopsis-Populus-Vitis-Oryza Shared Single-Copy genes (APVO SSC) against the EST Zamia databases in GenBank. From the first strategy, 30 loci were selected, and from the second, 16 loci. In both cases, the matching GenBank accessions of Zamia were used as a query for retrieving highly similar sequences from Cycas, Picea, Pinus species or Ginkgo biloba. After retrieving and aligning all the sequences in each locus, intron predictions were completed to assist in primer design. PCR was carried out in three rounds to detect paralogous loci. A total of 29 loci were successfully amplified as a single band of which 20 were likely single-copy loci. These loci showed different diversity and divergence levels. A preliminary screening allowed us to select 8 promising loci (40S, ATG2, BG, GroES, GTP, LiSH, PEX4 and TR) for the Zamia pumila complex and 4 loci (COS26, GroES, GTP and HTS) for all other cycad genera.

Journal ArticleDOI
TL;DR: This is the first report on the use of EST analysis to study the gene expression profiles in C. asiatica hairy roots after methyl jasmonate (MJ) treatment as an elicitor.
Abstract: Centella asiatica (L.) Urban, which is a perennial plant in the family Umbelliferae, is commercially utilized as a wound-healing agent, due to its potent anti-inflammatory effects. Despite the medicinal importance of C. asiatica, little genomic or transcriptomic data are available from the public databases. To identify the genes involved in biosynthesis of these materials, an expressed sequence tag (EST) analysis was performed from C. asiatica hairy roots after methyl jasmonate (MJ) treatment as an elicitor. Sequencing of 4,896 cDNA clones generated 4,381 5′-end high-quality ESTs (GenBank Accession Number JK513929–JK518371, average length 625.8 bp), of which 2,837 (376 contigs and 2,461 singletons) were revealed to be unique by sequence comparison, and 978 (34.4 % of total clusters) had a putative ATG start codon. A total of 2,420 unique sequences were annotated by a BLAST similarity search. Asiatic acid and madecassic acid, as precursors of asiaticoside and madecassoside, respectively, were biosynthesized from α-amyrin by cytochrome P450 hydroxylase and carboxylase (P450). Two centellosides were catalyzed from asiatic acid and madecassic acid by UDP-glucosyltransferases (UGTs). We identified 24 P450 and 13 UGT candidates from the database. Finally, 3 of the P450s and 1 UGT were selected as candidates most likely to be involved in centelloside biosynthesis in a MJ inducibility experiment based on a real-time reverse transcription-polymerase chain reaction assay. This is the first report on the use of EST analysis to study the gene expression profiles in C. asiatica hairy roots elicited by MJ.

Journal ArticleDOI
23 Jan 2014
TL;DR: The record of a single genotype determined a strain which could be incriminated for camel and human infectivity and responsible for its persistence in the endemic areas could guide the application of efficient control strategies of hydatidosis in Egypt.
Abstract: The objectives of the present study were to investigate strain identification of Echinococcus granulosus infecting camel and human in . Therefore partial sequences were generated after gel purification of nested PCR amplified products of mitochondrial NADH 1gene of Echinococcus granulosus complex. Sequences were further examined by sequence analysis and subsequent phylogeny to compare these sequences to those from known strains of E.granulosus circulating globally and retrieved from GenBank. All isolates are homologous to the camel strain, E. canadensis (G6) genotype. Nucleotide mutations generate polymorphism at position of 275 nucleotide, where a thymine replaced a cytosine and at the levels of 385 and 386 nucleotides, where two cytosine substituted a guanine and a thymine respectively. KF815488 Egypt showed typical identity (99.5%) with JN637176 Sudan, HM853659 Iran, AF386533 France and AJ237637 Poland with 0.5% diversion.. Phylogenetic analysis showed a robust tree clustering all isolates with sequences belonging to the camel genotype (G6) variant with strong bootstrap values at relevant nodes and the evolutionary distance between groups is very short. There are two mutations in the sequences of amino acids at the position of 92, where an Alanine is changed to a Valine and at the position of 129, where a Valine is transformed to a Proline. Our record of a single genotype determined a strain which could be incriminated for camel and human infectivity and responsible for its persistence in the endemic areas. Such epidemiological data could guide the application of efficient control strategies of hydatidosis in Egypt.

Journal ArticleDOI
15 Sep 2014-Gene
TL;DR: The phylogenomic reconstruction of this newly generated data with 21 Cypriniformes GenBank accession ID concurs with the recognized status of T. tambroides within the subfamily Cypr in agreement with previous hypotheses based on morphological and partial mitochondrial analyses.

Journal ArticleDOI
21 Oct 2014-PLOS ONE
TL;DR: The data demonstrate that He185/333 bears the same substantial characteristics as their S. purpuratus homologues, and identifies several unique characteristics of He 185/333 (such as novel element patterns, sequence repeats, distribution of positively-selected codons and introns), suggesting species-specific adaptations.
Abstract: This study characterizes the highly variable He185/333 genes, transcripts and proteins in coelomocytes of the sea urchin, Heliocidaris erythrogramma. Originally discovered in the purple sea urchin, Strongylocentrotus purpuratus, the products of this gene family participate in the anti-pathogen defenses of the host animals. Full-length He185/333 genes and transcripts are identified. Complete open reading frames of He185/333 homologues are analyzed as to their element structure, single nucleotide polymorphisms, indels and sequence repeats and are subjected to diversification analyses. The sequence elements that compose He185/333 are different to those identified for Sp185/333. Differences between Sp185/333 and He185/333 genes are also evident in the complexity of the sequences of the introns. He185/333 proteins show a diverse range of molecular weights on Western blots. The observed sizes and pIs of the proteins differ from predicted values, suggesting post-translational modifications and oligomerization. Immunofluorescence microscopy shows that He185/333 proteins are mainly located on the surface of coelomocyte subpopulations. Our data demonstrate that He185/333 bears the same substantial characteristics as their S. purpuratus homologues. However, we also identify several unique characteristics of He185/333 (such as novel element patterns, sequence repeats, distribution of positively-selected codons and introns), suggesting species-specific adaptations. All sequences in this publication have been submitted to Genbank (accession numbers JQ780171-JQ780321) and are listed in table S1.