scispace - formally typeset
Search or ask a question

Showing papers in "BMC Genomics in 2012"


Journal ArticleDOI
TL;DR: All three fast turnaround sequencers evaluated here were able to generate usable sequence, however there are key differences between the quality of that data and the applications it will support.
Abstract: Next generation sequencing (NGS) technology has revolutionized genomic and genetic research. The pace of change in this area is rapid with three major new sequencing platforms having been released in 2011: Ion Torrent’s PGM, Pacific Biosciences’ RS and the Illumina MiSeq. Here we compare the results obtained with those platforms to the performance of the Illumina HiSeq, the current market leader. In order to compare these platforms, and get sufficient coverage depth to allow meaningful analysis, we have sequenced a set of 4 microbial genomes with mean GC content ranging from 19.3 to 67.7%. Together, these represent a comprehensive range of genome content. Here we report our analysis of that sequence data in terms of coverage distribution, bias, GC distribution, variant detection and accuracy. Sequence generated by Ion Torrent, MiSeq and Pacific Biosciences technologies displays near perfect coverage behaviour on GC-rich, neutral and moderately AT-rich genomes, but a profound bias was observed upon sequencing the extremely AT-rich genome of Plasmodium falciparum on the PGM, resulting in no coverage for approximately 30% of the genome. We analysed the ability to call variants from each platform and found that we could call slightly more variants from Ion Torrent data compared to MiSeq data, but at the expense of a higher false positive rate. Variant calling from Pacific Biosciences data was possible but higher coverage depth was required. Context specific errors were observed in both PGM and MiSeq data, but not in that from the Pacific Biosciences platform. All three fast turnaround sequencers evaluated here were able to generate usable sequence. However there are key differences between the quality of that data and the applications it will support.

1,967 citations


Journal ArticleDOI
TL;DR: A robust and optimized Next-Generation Sequencing library amplification method suitable for extremely AT-rich genomes and will greatly benefit sequencing clinical samples that often require amplification due to low mass of DNA starting material is developed.
Abstract: Massively parallel sequencing technology is revolutionizing approaches to genomic and genetic research. Since its advent, the scale and efficiency of Next-Generation Sequencing (NGS) has rapidly improved. In spite of this success, sequencing genomes or genomic regions with extremely biased base composition is still a great challenge to the currently available NGS platforms. The genomes of some important pathogenic organisms like Plasmodium falciparum (high AT content) and Mycobacterium tuberculosis (high GC content) display extremes of base composition. The standard library preparation procedures that employ PCR amplification have been shown to cause uneven read coverage particularly across AT and GC rich regions, leading to problems in genome assembly and variation analyses. Alternative library-preparation approaches that omit PCR amplification require large quantities of starting material and hence are not suitable for small amounts of DNA/RNA such as those from clinical isolates. We have developed and optimized library-preparation procedures suitable for low quantity starting material and tolerant to extremely high AT content sequences.

550 citations


Journal ArticleDOI
TL;DR: CPGAVAS allows the semi-automatic and complete annotation of a chloroplast genome sequence, and the visualization, editing and analysis of the annotation results, and will become an indispensible tool for researchers studyingchloroplast genomes.
Abstract: The complete sequences of chloroplast genomes provide wealthy information regarding the evolutionary history of species. With the advance of next-generation sequencing technology, the number of completely sequenced chloroplast genomes is expected to increase exponentially, powerful computational tools annotating the genome sequences are in urgent need. We have developed a web server CPGAVAS. The server accepts a complete chloroplast genome sequence as input. First, it predicts protein-coding and rRNA genes based on the identification and mapping of the most similar, full-length protein, cDNA and rRNA sequences by integrating results from Blastx, Blastn, protein2genome and est2genome programs. Second, tRNA genes and inverted repeats (IR) are identified using tRNAscan, ARAGORN and vmatch respectively. Third, it calculates the summary statistics for the annotated genome. Fourth, it generates a circular map ready for publication. Fifth, it can create a Sequin file for GenBank submission. Last, it allows the extractions of protein and mRNA sequences for given list of genes and species. The annotation results in GFF3 format can be edited using any compatible annotation editing tools. The edited annotations can then be uploaded to CPGAVAS for update and re-analyses repeatedly. Using known chloroplast genome sequences as test set, we show that CPGAVAS performs comparably to another application DOGMA, while having several superior functionalities. CPGAVAS allows the semi-automatic and complete annotation of a chloroplast genome sequence, and the visualization, editing and analysis of the annotation results. It will become an indispensible tool for researchers studying chloroplast genomes. The software is freely accessible from http://www.herbalgenomics.org/cpgavas .

501 citations


Journal ArticleDOI
TL;DR: Overexpressed miR-146a was highly enriched both in transfected cells and their EVs, and the cellular:EV ratios of endogenous miRNAs were not grossly altered, suggesting a common mechanism for selective miRNA export may exist.
Abstract: MicroRNAs (miRNAs) are a class of small RNA molecules that regulate expression of specific mRNA targets. They can be released from cells, often encapsulated within extracellular vesicles (EVs), and therefore have the potential to mediate intercellular communication. It has been suggested that certain miRNAs may be selectively exported, although the mechanism has yet to be identified. Manipulation of the miRNA content of EVs will be important for future therapeutic applications. We therefore wished to assess which endogenous miRNAs are enriched in EVs and how effectively an overexpressed miRNA would be exported. Small RNA libraries from HEK293T cells and vesicles before or after transfection with a vector for miR-146a overexpression were analysed by deep sequencing. A subset of miRNAs was found to be enriched in EVs; pathway analysis of their predicted target genes suggests a potential role in regulation of endocytosis. RT-qPCR in additional cell types and analysis of publicly available data revealed that many of these miRNAs tend to be widely preferentially exported. Whilst overexpressed miR-146a was highly enriched both in transfected cells and their EVs, the cellular:EV ratios of endogenous miRNAs were not grossly altered. MiR-451 was consistently the most highly exported miRNA in many different cell types. Intriguingly, Argonaute2 (Ago2) is required for miR-451 maturation and knock out of Ago2 has been shown to decrease expression of other preferentially exported miRNAs (eg miR-150 and miR-142-3p). The global expression data provided by deep sequencing confirms that specific miRNAs are enriched in EVs released by HEK293T cells. Observation of similar patterns in a range of cell types suggests that a common mechanism for selective miRNA export may exist.

442 citations


Journal ArticleDOI
TL;DR: A comprehensive genome-wide analysis of chromosomal distribution, tandem repeats and phylogenetic relationship of MYB family genes in rice and Arabidopsis suggested their evolution via duplication.
Abstract: The MYB gene family comprises one of the richest groups of transcription factors in plants Plant MYB proteins are characterized by a highly conserved MYB DNA-binding domain MYB proteins are classified into four major groups namely, 1R-MYB, 2R-MYB, 3R-MYB and 4R-MYB based on the number and position of MYB repeats MYB transcription factors are involved in plant development, secondary metabolism, hormone signal transduction, disease resistance and abiotic stress tolerance A comparative analysis of MYB family genes in rice and Arabidopsis will help reveal the evolution and function of MYB genes in plants A genome-wide analysis identified at least 155 and 197 MYB genes in rice and Arabidopsis, respectively Gene structure analysis revealed that MYB family genes possess relatively more number of introns in the middle as compared with C- and N-terminal regions of the predicted genes Intronless MYB-genes are highly conserved both in rice and Arabidopsis MYB genes encoding R2R3 repeat MYB proteins retained conserved gene structure with three exons and two introns, whereas genes encoding R1R2R3 repeat containing proteins consist of six exons and five introns The splicing pattern is similar among R1R2R3 MYB genes in Arabidopsis In contrast, variation in splicing pattern was observed among R1R2R3 MYB members of rice Consensus motif analysis of 1kb upstream region (5′ to translation initiation codon) of MYB gene ORFs led to the identification of conserved and over-represented cis-motifs in both rice and Arabidopsis Real-time quantitative RT-PCR analysis showed that several members of MYBs are up-regulated by various abiotic stresses both in rice and Arabidopsis A comprehensive genome-wide analysis of chromosomal distribution, tandem repeats and phylogenetic relationship of MYB family genes in rice and Arabidopsis suggested their evolution via duplication Genome-wide comparative analysis of MYB genes and their expression analysis identified several MYBs with potential role in development and stress response of plants

405 citations


Journal ArticleDOI
TL;DR: It is suggested that chromatin modifications, such as H3K9ac and H2K14ac, are part of the active promoter state, are present over bivalent promoters and active enhancers and that the extent of H3 k9 and H3k14 acetylation could be driven by cis regulatory elements such as CpG content at promoters.
Abstract: Background: Transcription regulation in pluripotent embryonic stem (ES) cells is a complex process that involves multitude of regulatory layers, one of which is post-translational modification of histones. Acetylation of specific lysine residues of histones plays a key role in regulating gene expression. Results: Here we have investigated the genome-wide occurrence of two histone marks, acetylation of histone H3K9 and K14 (H3K9ac and H3K14ac), in mouse embryonic stem (mES) cells. Genome-wide H3K9ac and H3K14ac show very high correlation between each other as well as with other histone marks (such as H3K4me3) suggesting a coordinated regulation of active histone marks. Moreover, the levels of H3K9ac and H3K14ac directly correlate with the CpG content of the promoters attesting the importance of sequences underlying the specifically modified nucleosomes. Our data provide evidence that H3K9ac and H3K14ac are also present over the previously described bivalent promoters, along with H3K4me3 and H3K27me3. Furthermore, like H3K27ac, H3K9ac and H3K14ac can also differentiate active enhancers from inactive ones. Although, H3K9ac and H3K14ac, a hallmark of gene activation exhibit remarkable correlation over active and bivalent promoters as well as distal regulatory elements, a subset of inactive promoters is selectively enriched for H3K14ac. Conclusions: Our study suggests that chromatin modifications, such as H3K9ac and H3K14ac, are part of the active promoter state, are present over bivalent promoters and active enhancers and that the extent of H3K9 and H3K14 acetylation could be driven by cis regulatory elements such as CpG content at promoters. Our study also suggests that a subset of inactive promoters is selectively and specifically enriched for H3K14ac. This observation suggests that histone acetyl transferases (HATs) prime inactive genes by H3K14ac for stimuli dependent activation. In conclusion our study demonstrates a wider role for H3K9ac and H3K14ac in gene regulation than originally thought.

399 citations


Journal ArticleDOI
TL;DR: A new criterion for computational prediction of nitrogen fixation is proposed: the presence of a minimum set of six genes coding for structural and biosynthetic components, namely NifHDK and NifENB.
Abstract: The metabolic capacity for nitrogen fixation is known to be present in several prokaryotic species scattered across taxonomic groups. Experimental detection of nitrogen fixation in microbes requires species-specific conditions, making it difficult to obtain a comprehensive census of this trait. The recent and rapid increase in the availability of microbial genome sequences affords novel opportunities to re-examine the occurrence and distribution of nitrogen fixation genes. The current practice for computational prediction of nitrogen fixation is to use the presence of the nifH and/or nifD genes. Based on a careful comparison of the repertoire of nitrogen fixation genes in known diazotroph species we propose a new criterion for computational prediction of nitrogen fixation: the presence of a minimum set of six genes coding for structural and biosynthetic components, namely NifHDK and NifENB. Using this criterion, we conducted a comprehensive search in fully sequenced genomes and identified 149 diazotrophic species, including 82 known diazotrophs and 67 species not known to fix nitrogen. The taxonomic distribution of nitrogen fixation in Archaea was limited to the Euryarchaeota phylum; within the Bacteria domain we predict that nitrogen fixation occurs in 13 different phyla. Of these, seven phyla had not hitherto been known to contain species capable of nitrogen fixation. Our analyses also identified protein sequences that are similar to nitrogenase in organisms that do not meet the minimum-gene-set criteria. The existence of nitrogenase-like proteins lacking conserved co-factor ligands in both diazotrophs and non-diazotrophs suggests their potential for performing other, as yet unidentified, metabolic functions. Our predictions expand the known phylogenetic diversity of nitrogen fixation, and suggest that this trait may be much more common in nature than it is currently thought. The diverse phylogenetic distribution of nitrogenase-like proteins indicates potential new roles for anciently duplicated and divergent members of this group of enzymes.

373 citations


Journal ArticleDOI
TL;DR: By establishing the phylogenetic and positional relationship of potato NB-LRRs, the analysis offers significant insight into the evolution of potato R genes and provides a blueprint for future efforts to identify and more rapidly clone functional NB- LRR genes from Solanum species.
Abstract: The potato genome sequence derived from the Solanum tuberosum Group Phureja clone DM1-3 516 R44 provides unparalleled insight into the genome composition and organisation of this important crop. A key class of genes that comprises the vast majority of plant resistance (R) genes contains a nucleotide-binding and leucine-rich repeat domain, and is collectively known as NB-LRRs. As part of an effort to accelerate the process of functional R gene isolation, we performed an amino acid motif based search of the annotated potato genome and identified 438 NB-LRR type genes among the ~39,000 potato gene models. Of the predicted genes, 77 contain an N-terminal toll/interleukin 1 receptor (TIR)-like domain, and 107 of the remaining 361 non-TIR genes contain an N-terminal coiled-coil (CC) domain. Physical map positions were established for 370 predicted NB-LRR genes across all 12 potato chromosomes. The majority of NB-LRRs are physically organised within 63 identified clusters, of which 50 are homogeneous in that they contain NB-LRRs derived from a recent common ancestor. By establishing the phylogenetic and positional relationship of potato NB-LRRs, our analysis offers significant insight into the evolution of potato R genes. Furthermore, the data provide a blueprint for future efforts to identify and more rapidly clone functional NB-LRR genes from Solanum species.

276 citations


Journal ArticleDOI
TL;DR: This work has provided the most complete characterization of the genes expressed in an active snake venom gland to date, producing insights into snakebite pathology and guidance for snakebite treatment for the largest rattlesnake species and arguably the most dangerous snake native to the United States of America, C. adamanteus.
Abstract: Snake venoms have significant impacts on human populations through the morbidity and mortality associated with snakebites and as sources of drugs, drug leads, and physiological research tools. Genes expressed by venom-gland tissue, including those encoding toxic proteins, have therefore been sequenced but only with relatively sparse coverage resulting from the low-throughput sequencing approaches available. High-throughput approaches based on 454 pyrosequencing have recently been applied to the study of snake venoms to give the most complete characterizations to date of the genes expressed in active venom glands, but such approaches are costly and still provide a far-from-complete characterization of the genes expressed during venom production. We describe the de novo assembly and analysis of the venom-gland transcriptome of an eastern diamondback rattlesnake (Crotalus adamanteus) based on 95,643,958 pairs of quality-filtered, 100-base-pair Illumina reads. We identified 123 unique, full-length toxin-coding sequences, which cluster into 78 groups with less than 1% nucleotide divergence, and 2,879 unique, full-length nontoxin coding sequences. The toxin sequences accounted for 35.4% of the total reads, and the nontoxin sequences for an additional 27.5%. The most highly expressed toxin was a small myotoxin related to crotamine, which accounted for 5.9% of the total reads. Snake-venom metalloproteinases accounted for the highest percentage of reads mapping to a toxin class (24.4%), followed by C-type lectins (22.2%) and serine proteinases (20.0%). The most diverse toxin classes were the C-type lectins (21 clusters), the snake-venom metalloproteinases (16 clusters), and the serine proteinases (14 clusters). The high-abundance nontoxin transcripts were predominantly those involved in protein folding and translation, consistent with the protein-secretory function of the tissue. We have provided the most complete characterization of the genes expressed in an active snake venom gland to date, producing insights into snakebite pathology and guidance for snakebite treatment for the largest rattlesnake species and arguably the most dangerous snake native to the United States of America, C. adamanteus. We have more than doubled the number of sequenced toxins for this species and created extensive genomic resources for snakes based entirely on de novo assembly of Illumina sequence data.

269 citations


Journal ArticleDOI
TL;DR: It is found that primary DNA sequence divergence is the major determinant of methylational differences at the whole genome level, but DNA methylational difference alone can only account for limited gene expression variation between the cultivated and wild rice.
Abstract: DNA methylation plays important biological roles in plants and animals. To examine the rice genomic methylation landscape and assess its functional significance, we generated single-base resolution DNA methylome maps for Asian cultivated rice Oryza sativa ssp. japonica, indica and their wild relatives, Oryza rufipogon and Oryza nivara. The overall methylation level of rice genomes is four times higher than that of Arabidopsis. Consistent with the results reported for Arabidopsis, methylation in promoters represses gene expression while gene-body methylation generally appears to be positively associated with gene expression. Interestingly, we discovered that methylation in gene transcriptional termination regions (TTRs) can significantly repress gene expression, and the effect is even stronger than that of promoter methylation. Through integrated analysis of genomic, DNA methylomic and transcriptomic differences between cultivated and wild rice, we found that primary DNA sequence divergence is the major determinant of methylational differences at the whole genome level, but DNA methylational difference alone can only account for limited gene expression variation between the cultivated and wild rice. Furthermore, we identified a number of genes with significant difference in methylation level between the wild and cultivated rice. The single-base resolution methylomes of rice obtained in this study have not only broadened our understanding of the mechanism and function of DNA methylation in plant genomes, but also provided valuable data for future studies of rice epigenetics and the epigenetic differentiation between wild and cultivated rice.

267 citations


Journal ArticleDOI
TL;DR: This study demonstrates the potential of a transcriptome-enabled, multiplexed, exon capture method to create thousands of informative markers for population genomic and phylogenetic studies in non-model species across the tree of life.
Abstract: To date, exon capture has largely been restricted to species with fully sequenced genomes, which has precluded its application to lineages that lack high quality genomic resources. We developed a novel strategy for designing array-based exon capture in chipmunks (Tamias) based on de novo transcriptome assemblies. We evaluated the performance of our approach across specimens from four chipmunk species. We selectively targeted 11,975 exons (~4 Mb) on custom capture arrays, and enriched over 99% of the targets in all libraries. The percentage of aligned reads was highly consistent (24.4-29.1%) across all specimens, including in multiplexing up to 20 barcoded individuals on a single array. Base coverage among specimens and within targets in each species library was uniform, and the performance of targets among independent exon captures was highly reproducible. There was no decrease in coverage among chipmunk species, which showed up to 1.5% sequence divergence in coding regions. We did observe a decline in capture performance of a subset of targets designed from a much more divergent ground squirrel genome (30 My), however, over 90% of the targets were also recovered. Final assemblies yielded over ten thousand orthologous loci (~3.6 Mb) with thousands of fixed and polymorphic SNPs among species identified. Our study demonstrates the potential of a transcriptome-enabled, multiplexed, exon capture method to create thousands of informative markers for population genomic and phylogenetic studies in non-model species across the tree of life.

Journal ArticleDOI
TL;DR: The findings identified the possible functions of these SNP loci, and provide the basis for subsequent functional research, and could identify the putative miRNA-related SNPs from GWAS and eQTLs researches.
Abstract: Numerous single nucleotide polymorphisms (SNPs) associated with complex diseases have been identified by genome-wide association studies (GWAS) and expression quantitative trait loci (eQTLs) studies. However, few of these SNPs have explicit biological functions. Recent studies indicated that the SNPs within the 3’UTR regions of susceptibility genes could affect complex traits/diseases by affecting the function of miRNAs. These 3’UTR SNPs are functional candidates and therefore of interest to GWAS and eQTL researchers. We developed a publicly available online database, MirSNP ( http://cmbi.bjmu.edu.cn/mirsnp ), which is a collection of human SNPs in predicted miRNA-mRNA binding sites. We identified 414,510 SNPs that might affect miRNA-mRNA binding. Annotations were added to these SNPs to predict whether a SNP within the target site would decrease/break or enhance/create an miRNA-mRNA binding site. By applying MirSNP database to three brain eQTL data sets, we identified four unreported SNPs (rs3087822, rs13042, rs1058381, and rs1058398), which might affect miRNA binding and thus affect the expression of their host genes in the brain. We also applied the MirSNP database to our GWAS for schizophrenia: seven predicted miRNA-related SNPs (p < 0.0001) were found in the schizophrenia GWAS. Our findings identified the possible functions of these SNP loci, and provide the basis for subsequent functional research. MirSNP could identify the putative miRNA-related SNPs from GWAS and eQTLs researches and provide the direction for subsequent functional researches.

Journal ArticleDOI
TL;DR: Six main performance evaluation measures are introduced and together with receiver operating characteristics (ROC) analysis they provide a good picture about the performance of methods and allow their objective and quantitative comparison.
Abstract: Prediction methods are increasingly used in biosciences to forecast diverse features and characteristics. Binary two-state classifiers are the most common applications. They are usually based on machine learning approaches. For the end user it is often problematic to evaluate the true performance and applicability of computational tools as some knowledge about computer science and statistics would be needed. Instructions are given on how to interpret and compare method evaluation results. For systematic method performance analysis is needed established benchmark datasets which contain cases with known outcome, and suitable evaluation measures. The criteria for benchmark datasets are discussed along with their implementation in VariBench, benchmark database for variations. There is no single measure that alone could describe all the aspects of method performance. Predictions of genetic variation effects on DNA, RNA and protein level are important as information about variants can be produced much faster than their disease relevance can be experimentally verified. Therefore numerous prediction tools have been developed, however, systematic analyses of their performance and comparison have just started to emerge. The end users of prediction tools should be able to understand how evaluation is done and how to interpret the results. Six main performance evaluation measures are introduced. These include sensitivity, specificity, positive predictive value, negative predictive value, accuracy and Matthews correlation coefficient. Together with receiver operating characteristics (ROC) analysis they provide a good picture about the performance of methods and allow their objective and quantitative comparison. A checklist of items to look at is provided. Comparisons of methods for missense variant tolerance, protein stability changes due to amino acid substitutions, and effects of variations on mRNA splicing are presented.

Journal ArticleDOI
TL;DR: By visually presenting sequence conservation information along with functional classifications and sequence composition characteristics, CCT can be a useful tool for identifying rapidly evolving or novel sequences, horizontally transferred sequences, or unusual functional properties in newly sequenced genomes.
Abstract: Continued sequencing efforts coupled with advances in sequencing technology will lead to the completion of a vast number of small genomes. Whole-genome comparisons represent an important part of the analysis of any new genome sequence, as they can provide a better understanding of the biology and evolution of the source organism. Visualization of the results is important, as it allows information from a variety of sources to be integrated and interpreted. However, existing graphical comparison tools lack features needed for efficiently comparing a new genome to hundreds or thousands of existing sequences. Moreover, existing tools are limited in terms of the types of comparisons that can be performed, the extent to which the output can be customized, and the ease with which the entire process can be automated. The CGView Comparison Tool (CCT) is a package for visually comparing bacterial, plasmid, chloroplast, or mitochondrial sequences of interest to existing genomes or sequence collections. The comparisons are conducted using BLAST, and the BLAST results are presented in the form of graphical maps that can also show sequence features, gene and protein names, COG (Clusters of Orthologous Groups of proteins) category assignments, and sequence composition characteristics. CCT can generate maps in a variety of sizes, including 400 Megapixel maps suitable for posters. Comparisons can be conducted within a particular species or genus, or all available genomes can be used. The entire map creation process, from downloading sequences to redrawing zoomed maps, can be completed easily using scripts included with the CCT. User-defined features or analysis results can be included on maps, and maps can be extensively customized. To simplify program setup, a CCT virtual machine that includes all dependencies preinstalled is available. Detailed tutorials illustrating the use of CCT are included with the CCT documentation. CCT can be used to visually compare a reference sequence to thousands of existing genomes or sequence collections (next-generation sequencing reads for example) on a standard desktop computer. It provides analysis and visualization functionality not available in any existing circular genome visualization tool. By visually presenting sequence conservation information along with functional classifications and sequence composition characteristics, CCT can be a useful tool for identifying rapidly evolving or novel sequences, horizontally transferred sequences, or unusual functional properties in newly sequenced genomes. CCT is freely available for download at http://stothard.afns.ualberta.ca/downloads/CCT/ .

Journal ArticleDOI
TL;DR: This work defines a custom data processing pipeline for Pacific Biosciences data for human data analysis, observing high sensitivity and specificity for calling differences in amplicons containing known true or false SNPs.
Abstract: Pacific Biosciences technology provides a fundamentally new data type that provides the potential to overcome some limitations of current next generation sequencing platforms by providing significantly longer reads, single molecule sequencing, low composition bias and an error profile that is orthogonal to other platforms. With these potential advantages in mind, we here evaluate the utility of the Pacific Biosciences RS platform for human medical amplicon resequencing projects. We evaluated the Pacific Biosciences technology for SNP discovery in medical resequencing projects using the Genome Analysis Toolkit, observing high sensitivity and specificity for calling differences in amplicons containing known true or false SNPs. We assessed data quality: most errors were indels (~14%) with few apparent miscalls (~1%). In this work, we define a custom data processing pipeline for Pacific Biosciences data for human data analysis. Critically, the error properties were largely free of the context-specific effects that affect other sequencing technologies. These data show excellent utility for follow-up validation and extension studies in human data and medical genetics projects, but can be extended to other organisms with a reference genome.

Journal ArticleDOI
TL;DR: The R package copynumber is a software suite for segmentation of single- and multi-track copy number data using algorithms based on coherent least squares principles.
Abstract: Cancer progression is associated with genomic instability and an accumulation of gains and losses of DNA. The growing variety of tools for measuring genomic copy numbers, including various types of array-CGH, SNP arrays and high-throughput sequencing, calls for a coherent framework offering unified and consistent handling of single- and multi-track segmentation problems. In addition, there is a demand for highly computationally efficient segmentation algorithms, due to the emergence of very high density scans of copy number. A comprehensive Bioconductor package for copy number analysis is presented. The package offers a unified framework for single sample, multi-sample and multi-track segmentation and is based on statistically sound penalized least squares principles. Conditional on the number of breakpoints, the estimates are optimal in the least squares sense. A novel and computationally highly efficient algorithm is proposed that utilizes vector-based operations in R. Three case studies are presented. The R package copynumber is a software suite for segmentation of single- and multi-track copy number data using algorithms based on coherent least squares principles.

Journal ArticleDOI
TL;DR: It is supported that transcriptome analysis based on Illumina paired-end sequencing is a powerful tool for transcriptome characterization and molecular marker development in non-model species, especially those with large and complex genomes.
Abstract: In rubber tree, bark is one of important agricultural and biological organs. However, the molecular mechanism involved in the bark formation and development in rubber tree remains largely unknown, which is at least partially due to lack of bark transcriptomic and genomic information. Therefore, it is necessary to carried out high-throughput transcriptome sequencing of rubber tree bark to generate enormous transcript sequences for the functional characterization and molecular marker development. In this study, more than 30 million sequencing reads were generated using Illumina paired-end sequencing technology. In total, 22,756 unigenes with an average length of 485 bp were obtained with de novo assembly. The similarity search indicated that 16,520 and 12,558 unigenes showed significant similarities to known proteins from NCBI non-redundant and Swissprot protein databases, respectively. Among these annotated unigenes, 6,867 and 5,559 unigenes were separately assigned to Gene Ontology (GO) and Clusters of Orthologous Group (COG). When 22,756 unigenes searched against the Kyoto Encyclopedia of Genes and Genomes Pathway (KEGG) database, 12,097 unigenes were assigned to 5 main categories including 123 KEGG pathways. Among the main KEGG categories, metabolism was the biggest category (9,043, 74.75%), suggesting the active metabolic processes in rubber tree bark. In addition, a total of 39,257 EST-SSRs were identified from 22,756 unigenes, and the characterizations of EST-SSRs were further analyzed in rubber tree. 110 potential marker sites were randomly selected to validate the assembly quality and develop EST-SSR markers. Among 13 Hevea germplasms, PCR success rate and polymorphism rate of 110 markers were separately 96.36% and 55.45% in this study. By assembling and analyzing de novo transcriptome sequencing data, we reported the comprehensive functional characterization of rubber tree bark. This research generated a substantial fraction of rubber tree transcriptome sequences, which were very useful resources for gene annotation and discovery, molecular markers development, genome assembly and annotation, and microarrays development in rubber tree. The EST-SSR markers identified and developed in this study will facilitate marker-assisted selection breeding in rubber tree. Moreover, this study also supported that transcriptome analysis based on Illumina paired-end sequencing is a powerful tool for transcriptome characterization and molecular marker development in non-model species, especially those with large and complex genomes.

Journal ArticleDOI
TL;DR: HTRIdb is a powerful user-friendly tool from which human experimentally validated TF-TG interactions can be easily extracted and used to construct transcriptional regulation interaction networks enabling researchers to decipher the regulation of biological processes.
Abstract: The modeling of interactions among transcription factors (TFs) and their respective target genes (TGs) into transcriptional regulatory networks is important for the complete understanding of regulation of biological processes. In the case of experimentally verified human TF-TG interactions, there is no database at present that explicitly provides such information even though many databases containing human TF-TG interaction data have been available. In an effort to provide researchers with a repository of experimentally verified human TF-TG interactions from which such interactions can be directly extracted, we present here the Human Transcriptional Regulation Interactions database (HTRIdb). The HTRIdb is an open-access database that can be searched via a user-friendly web interface and the retrieved TF-TG interactions data and the associated protein-protein interactions can be downloaded or interactively visualized as a network through the web version of the popular Cytoscape visualization tool, the Cytoscape Web. Moreover, users can improve the database quality by uploading their own interactions and indicating inconsistencies in the data. So far, HTRIdb has been populated with 284 TFs that regulate 18302 genes, totaling 51871 TF-TG interactions. HTRIdb is freely available at http://www.lbbc.ibb.unesp.br/htri. HTRIdb is a powerful user-friendly tool from which human experimentally validated TF-TG interactions can be easily extracted and used to construct transcriptional regulation interaction networks enabling researchers to decipher the regulation of biological processes.

Journal ArticleDOI
TL;DR: The findings suggest that the number of reads typically produced in a single lane of the Illumina HiSeq sequencer far exceeds the number needed to saturate the annotated transcriptomes of diverse bacteria growing in monoculture.
Abstract: High-throughput sequencing of cDNA libraries (RNA-Seq) has proven to be a highly effective approach for studying bacterial transcriptomes. A central challenge in designing RNA-Seq-based experiments is estimating a priori the number of reads per sample needed to detect and quantify thousands of individual transcripts with a large dynamic range of abundance. We have conducted a systematic examination of how changes in the number of RNA-Seq reads per sample influences both profiling of a single bacterial transcriptome and the comparison of gene expression among samples. Our findings suggest that the number of reads typically produced in a single lane of the Illumina HiSeq sequencer far exceeds the number needed to saturate the annotated transcriptomes of diverse bacteria growing in monoculture. Moreover, as sequencing depth increases, so too does the detection of cDNAs that likely correspond to spurious transcripts or genomic DNA contamination. Finally, even when dozens of barcoded individual cDNA libraries are sequenced in a single lane, the vast majority of transcripts in each sample can be detected and numerous genes differentially expressed between samples can be identified. Our analysis provides a guide for the many researchers seeking to determine the appropriate sequencing depth for RNA-Seq-based studies of diverse bacterial species.

Journal ArticleDOI
TL;DR: It is shown that Illumina and Helicos sequences recovered from aDNA extracts could not be aligned to modern reference genomes with the same efficiency unless mapping parameters are optimized for the specific types of errors generated by these platforms and by post-mortem DNA damage.
Abstract: Next-Generation Sequencing has revolutionized our approach to ancient DNA (aDNA) research, by providing complete genomic sequences of ancient individuals and extinct species. However, the recovery of genetic material from long-dead organisms is still complicated by a number of issues, including post-mortem DNA damage and high levels of environmental contamination. Together with error profiles specific to the type of sequencing platforms used, these specificities could limit our ability to map sequencing reads against modern reference genomes and therefore limit our ability to identify endogenous ancient reads, reducing the efficiency of shotgun sequencing aDNA. In this study, we compare different computational methods for improving the accuracy and sensitivity of aDNA sequence identification, based on shotgun sequencing reads recovered from Pleistocene horse extracts using Illumina GAIIx and Helicos Heliscope platforms. We show that the performance of the Burrows Wheeler Aligner (BWA), that has been developed for mapping of undamaged sequencing reads using platforms with low rates of indel-types of sequencing errors, can be employed at acceptable run-times by modifying default parameters in a platform-specific manner. We also examine if trimming likely damaged positions at read ends can increase the recovery of genuine aDNA fragments and if accurate identification of human contamination can be achieved using a strategy previously suggested based on best hit filtering. We show that combining our different mapping and filtering approaches can increase the number of high-quality endogenous hits recovered by up to 33%. We have shown that Illumina and Helicos sequences recovered from aDNA extracts could not be aligned to modern reference genomes with the same efficiency unless mapping parameters are optimized for the specific types of errors generated by these platforms and by post-mortem DNA damage. Our findings have important implications for future aDNA research, as we define mapping guidelines that improve our ability to identify genuine aDNA sequences, which in turn could improve the genotyping accuracy of ancient specimens. Our framework provides a significant improvement to the standard procedures used for characterizing ancient genomes, which is challenged by contamination and often low amounts of DNA material.

Journal ArticleDOI
TL;DR: A successful global analysis of the peanut transcriptome using RNA-seq, a large number of unigenes were assembled, and almost four thousand SSR primers were developed will facilitate gene discovery and functional genomic studies of the peanuts plant.
Abstract: Background: The peanut (Arachis hypogaea L.) is an important oilseed crop in tropical and subtropical regions of the world. However, little about the molecular biology of the peanut is currently known. Recently, next-generation sequencing technology, termed RNA-seq, has provided a powerful approach for analysing the transcriptome, and for shedding light on the molecular biology of peanut. Results: In this study, we employed RNA-seq to analyse the transcriptomes of the immature seeds of three different peanut varieties with different oil contents. A total of 26.1-27.2 million paired-end reads with lengths of 100 bp were generated from the three varieties and 59,077 unigenes were assembled with N50 of 823 bp. Based on sequence similarity search with known proteins, a total of 40,100 genes were identified. Among these unigenes, only 8,252 unigenes were annotated with 42 gene ontology (GO) functional categories. And 18,028 unigenes mapped to 125 pathways by searching against the Kyoto Encyclopedia of Genes and Genomes pathway database (KEGG). In addition, 3,919 microsatellite markers were developed in the unigene library, and 160 PCR primers of SSR loci were used for validation of the amplification and the polymorphism. Conclusion: We completed a successful global analysis of the peanut transcriptome using RNA-seq, a large number of unigenes were assembled, and almost four thousand SSR primers were developed. These data will facilitate gene discovery and functional genomic studies of the peanut plant. In addition, this study provides insight into the complex transcriptome of the peanut and established a biotechnological platform for future research.

Journal ArticleDOI
TL;DR: The UniGenes were annotated, providing a platform for functional genomic research with this species and the molecular mechanisms for changes in fruit color and taste during ripening were examined.
Abstract: Background Chinese bayberry (Myrica rubra Sieb. and Zucc.) is an important subtropical fruit crop and an ideal species for fruit quality research due to the rapid and substantial changes that occur during development and ripening, including changes in fruit color and taste. However, research at the molecular level is limited by a lack of sequence data. The present study was designed to obtain transcript sequence data and examine gene expression in bayberry developing fruit based on RNA-Seq and bioinformatic analysis, to provide a foundation for understanding the molecular mechanisms controlling fruit quality changes during ripening.

Journal ArticleDOI
TL;DR: This is the first study to describe the comprehensive bovine milk transcriptome in Holstein cows and reveals that 69% of NCBI Btau 4.0 annotated genes are expressed in bovin milk somatic cells, indicating the ability of milk somatics cells to adapt to different molecular functions according to the biological need of the animal.
Abstract: Background: Cow milk is a complex bioactive fluid consumed by humans beyond infancy. Even though the chemical and physical properties of cow milk are well characterized, very limited research has been done on characterizing the milk transcriptome. This study performs a comprehensive expression profiling of genes expressed in milk somatic cells of transition (day 15), peak (day 90) and late (day 250) lactation Holstein cows by RNA sequencing. Milk samples were collected from Holstein cows at 15, 90 and 250 days of lactation, and RNA was extracted from the pelleted milk cells. Gene expression analysis was conducted by Illumina RNA sequencing. Sequence reads were assembled and analyzed in CLC Genomics Workbench. Gene Ontology (GO) and pathway analysis were performed using the Blast2GO program and GeneGo application of MetaCore program. Results: A total of 16,892 genes were expressed in transition lactation, 19,094 genes were expressed in peak lactation and 18,070 genes were expressed in late lactation. Regardless of the lactation stage approximately 9,000 genes showed ubiquitous expression. Genes encoding caseins, whey proteins and enzymes in lactose synthesis pathway showed higher expression in early lactation. The majority of genes in the fat metabolism pathway had high expression in transition and peak lactation milk. Most of the genes encoding for endogenous proteases and enzymes in ubiquitin-proteasome pathway showed higher expression along the course of lactation. Conclusions: This is the first study to describe the comprehensive bovine milk transcriptome in Holstein cows. The results revealed that 69% of NCBI Btau 4.0 annotated genes are expressed in bovine milk somatic cells. Most of the genes were ubiquitously expressed in all three stages of lactation. However, a fraction of the milk transcriptome has genes devoted to specific functions unique to the lactation stage. This indicates the ability of milk somatic cells to adapt to different molecular functions according to the biological need of the animal. This study provides a valuable insight into the biology of lactation in the cow, as well as many avenues for future research on the bovine lactome.

Journal ArticleDOI
TL;DR: The M. phaseolina genome provides a framework of the infection process at the cytological and molecular level which uses a diverse arsenal of enzymatic and toxin tools to destroy the host plants.
Abstract: Macrophomina phaseolina is one of the most destructive necrotrophic fungal pathogens that infect more than 500 plant species throughout the world. It can grow rapidly in infected plants and subsequently produces a large amount of sclerotia that plugs the vessels, resulting in wilting of the plant. We sequenced and assembled ~49 Mb into 15 super-scaffolds covering 92.83% of the M. phaseolina genome. We predict 14,249 open reading frames (ORFs) of which 9,934 are validated by the transcriptome. This phytopathogen has an abundance of secreted oxidases, peroxidases, and hydrolytic enzymes for degrading cell wall polysaccharides and lignocelluloses to penetrate into the host tissue. To overcome the host plant defense response, M. phaseolina encodes a significant number of P450s, MFS type membrane transporters, glycosidases, transposases, and secondary metabolites in comparison to all sequenced ascomycete species. A strikingly distinct set of carbohydrate esterases (CE) are present in M. phaseolina, with the CE9 and CE10 families remarkably higher than any other fungi. The phenotypic microarray data indicates that M. phaseolina can adapt to a wide range of osmotic and pH environments. As a broad host range pathogen, M. phaseolina possesses a large number of pathogen-host interaction genes including those for adhesion, signal transduction, cell wall breakdown, purine biosynthesis, and potent mycotoxin patulin. The M. phaseolina genome provides a framework of the infection process at the cytological and molecular level which uses a diverse arsenal of enzymatic and toxin tools to destroy the host plants. Further understanding of the M. phaseolina genome-based plant-pathogen interactions will be instrumental in designing rational strategies for disease control, essential to ensuring global agricultural crop production and security.

Journal ArticleDOI
TL;DR: VICUNA, a publicly available software tool, that enables consensus assembly of ultra-deep sequence derived from diverse viral populations, and its application to other heterogeneous sequence data sets such as metagenomic or tumor cell population samples may prove beneficial in these fields of research.
Abstract: Extensive genetic diversity in viral populations within infected hosts and the divergence of variants from existing reference genomes impede the analysis of deep viral sequencing data. A de novo population consensus assembly is valuable both as a single linear representation of the population and as a backbone on which intra-host variants can be accurately mapped. The availability of consensus assemblies and robustly mapped variants are crucial to the genetic study of viral disease progression, transmission dynamics, and viral evolution. Existing de novo assembly techniques fail to robustly assemble ultra-deep sequence data from genetically heterogeneous populations such as viruses into full-length genomes due to the presence of extensive genetic variability, contaminants, and variable sequence coverage. We present VICUNA, a de novo assembly algorithm suitable for generating consensus assemblies from genetically heterogeneous populations. We demonstrate its effectiveness on Dengue, Human Immunodeficiency and West Nile viral populations, representing a range of intra-host diversity. Compared to state-of-the-art assemblers designed for haploid or diploid systems, VICUNA recovers full-length consensus and captures insertion/deletion polymorphisms in diverse samples. Final assemblies maintain a high base calling accuracy. VICUNA program is publicly available at: http://www.broadinstitute.org/scientific-community/science/projects/viral-genomics/viral-genomics-analysis-software . We developed VICUNA, a publicly available software tool, that enables consensus assembly of ultra-deep sequence derived from diverse viral populations. While VICUNA was developed for the analysis of viral populations, its application to other heterogeneous sequence data sets such as metagenomic or tumor cell population samples may prove beneficial in these fields of research.

Journal ArticleDOI
TL;DR: Analysis of the power to detect differential expression in a range of scenarios including simulated null and differential expression distributions with varying numbers of biological or technical replicates, sequencing depths and analysis methods suggests that greater power is gained through the use of biological replicates relative to library (technical) replicates and sequencing depth.
Abstract: RNA sequencing (RNA-Seq) has emerged as a powerful approach for the detection of differential gene expression with both high-throughput and high resolution capabilities possible depending upon the experimental design chosen. Multiplex experimental designs are now readily available, these can be utilised to increase the numbers of samples or replicates profiled at the cost of decreased sequencing depth generated per sample. These strategies impact on the power of the approach to accurately identify differential expression. This study presents a detailed analysis of the power to detect differential expression in a range of scenarios including simulated null and differential expression distributions with varying numbers of biological or technical replicates, sequencing depths and analysis methods. Differential and non-differential expression datasets were simulated using a combination of negative binomial and exponential distributions derived from real RNA-Seq data. These datasets were used to evaluate the performance of three commonly used differential expression analysis algorithms and to quantify the changes in power with respect to true and false positive rates when simulating variations in sequencing depth, biological replication and multiplex experimental design choices. This work quantitatively explores comparisons between contemporary analysis tools and experimental design choices for the detection of differential expression using RNA-Seq. We found that the DESeq algorithm performs more conservatively than edgeR and NBPSeq. With regard to testing of various experimental designs, this work strongly suggests that greater power is gained through the use of biological replicates relative to library (technical) replicates and sequencing depth. Strikingly, sequencing depth could be reduced as low as 15% without substantial impacts on false positive or true positive rates.

Journal ArticleDOI
TL;DR: The complete genome of P. digitatum, the first of a phytopathogenic Penicillium species, is a valuable tool for understanding the virulence mechanisms and host-specificity of this economically important pest.
Abstract: Penicillium digitatum is a fungal necrotroph causing a common citrus postharvest disease known as green mold. In order to gain insight into the genetic bases of its virulence mechanisms and its high degree of host-specificity, the genomes of two P. digitatum strains that differ in their antifungal resistance traits have been sequenced and compared with those of 28 other Pezizomycotina. The two sequenced genomes are highly similar, but important differences between them include the presence of a unique gene cluster in the resistant strain, and mutations previously shown to confer fungicide resistance. The two strains, which were isolated in Spain, and another isolated in China have identical mitochondrial genome sequences suggesting a recent worldwide expansion of the species. Comparison with the closely-related but non-phytopathogenic P. chrysogenum reveals a much smaller gene content in P. digitatum, consistent with a more specialized lifestyle. We show that large regions of the P. chrysogenum genome, including entire supercontigs, are absent from P. digitatum, and that this is the result of large gene family expansions rather than acquisition through horizontal gene transfer. Our analysis of the P. digitatum genome is indicative of heterothallic sexual reproduction and reveals the molecular basis for the inability of this species to assimilate nitrate or produce the metabolites patulin and penicillin. Finally, we identify the predicted secretome, which provides a first approximation to the protein repertoire used during invasive growth. The complete genome of P. digitatum, the first of a phytopathogenic Penicillium species, is a valuable tool for understanding the virulence mechanisms and host-specificity of this economically important pest.

Journal ArticleDOI
TL;DR: The results of comparing a large and diverse E. coli dataset support the theory that reliable and good resolution phylogenies can be inferred from the core-genome and suggest that the resolution at the isolate level may, subsequently, be improved by targeting more variable genes.
Abstract: Escherichia coli exists in commensal and pathogenic forms. By measuring the variation of individual genes across more than a hundred sequenced genomes, gene variation can be studied in detail, including the number of mutations found for any given gene. This knowledge will be useful for creating better phylogenies, for determination of molecular clocks and for improved typing techniques. We find 3,051 gene clusters/families present in at least 95% of the genomes and 1,702 gene clusters present in 100% of the genomes. The former 'soft core' of about 3,000 gene families is perhaps more biologically relevant, especially considering that many of these genome sequences are draft quality. The E. coli pan-genome for this set of isolates contains 16,373 gene clusters. A core-gene tree, based on alignment and a pan-genome tree based on gene presence/absence, maps the relatedness of the 186 sequenced E. coli genomes. The core-gene tree displays high confidence and divides the E. coli strains into the observed MLST type clades and also separates defined phylotypes. The results of comparing a large and diverse E. coli dataset support the theory that reliable and good resolution phylogenies can be inferred from the core-genome. The results further suggest that the resolution at the isolate level may, subsequently be improved by targeting more variable genes. The use of whole genome sequencing will make it possible to eliminate, or at least reduce, the need for several typing steps used in traditional epidemiology.

Journal ArticleDOI
TL;DR: It is hypothesized that the associated genes originated from an ancestral gene, encoding a secreted ribonuclease, duplicated successively by repetitive DNA-driven processes and diversified during the evolution of the grass and cereal powdery mildew lineage.
Abstract: Protein effectors of pathogenicity are instrumental in modulating host immunity and disease resistance. The powdery mildew pathogen of grasses Blumeria graminis causes one of the most important diseases of cereal crops. B. graminis is an obligate biotrophic pathogen and as such has an absolute requirement to suppress or avoid host immunity if it is to survive and cause disease. Here we characterise a superfamily predicted to be the full complement of Candidates for Secreted Effector Proteins (CSEPs) in the fungal barley powdery mildew parasite B. graminis f.sp. hordei. The 491 genes encoding these proteins constitute over 7% of this pathogen’s annotated genes and most were grouped into 72 families of up to 59 members. They were predominantly expressed in the intracellular feeding structures called haustoria, and proteins specifically associated with the haustoria were identified by large-scale mass spectrometry-based proteomics. There are two major types of effector families: one comprises shorter proteins (100–150 amino acids), with a high relative expression level in the haustoria and evidence of extensive diversifying selection between paralogs; the second type consists of longer proteins (300–400 amino acids), with lower levels of differential expression and evidence of purifying selection between paralogs. An analysis of the predicted protein structures underscores their overall similarity to known fungal effectors, but also highlights unexpected structural affinities to ribonucleases throughout the entire effector super-family. Candidate effector genes belonging to the same family are loosely clustered in the genome and are associated with repetitive DNA derived from retro-transposons. We employed the full complement of genomic, transcriptomic and proteomic analyses as well as structural prediction methods to identify and characterize the members of the CSEPs superfamily in B. graminis f.sp. hordei. Based on relative intron position and the distribution of CSEPs with a ribonuclease-like domain in the phylogenetic tree we hypothesize that the associated genes originated from an ancestral gene, encoding a secreted ribonuclease, duplicated successively by repetitive DNA-driven processes and diversified during the evolution of the grass and cereal powdery mildew lineage.

Journal ArticleDOI
TL;DR: Strong support is found for the classical model of genetic variants regulating methylation, which in turn regulates gene expression, and it is shown that, although the methylation and expression modules differ, they are highly correlated.
Abstract: The predominant model for regulation of gene expression through DNA methylation is an inverse association in which increased methylation results in decreased gene expression levels. However, recent studies suggest that the relationship between genetic variation, DNA methylation and expression is more complex. Systems genetic approaches for examining relationships between gene expression and methylation array data were used to find both negative and positive associations between these levels. A weighted correlation network analysis revealed that i) both transcriptome and methylome are organized in modules, ii) co-expression modules are generally not preserved in the methylation data and vice-versa, and iii) highly significant correlations exist between co-expression and co-methylation modules, suggesting the existence of factors that affect expression and methylation of different modules (i.e., trans effects at the level of modules). We observed that methylation probes associated with expression in cis were more likely to be located outside CpG islands, whereas specificity for CpG island shores was present when methylation, associated with expression, was under local genetic control. A structural equation model based analysis found strong support in particular for a traditional causal model in which gene expression is regulated by genetic variation via DNA methylation instead of gene expression affecting DNA methylation levels. Our results provide new insights into the complex mechanisms between genetic markers, epigenetic mechanisms and gene expression. We find strong support for the classical model of genetic variants regulating methylation, which in turn regulates gene expression. Moreover we show that, although the methylation and expression modules differ, they are highly correlated.