scispace - formally typeset
Search or ask a question

Showing papers on "Sequence analysis published in 2010"


Journal ArticleDOI
TL;DR: A new method and the corresponding software tool, PolyPhen-2, which is different from the early tool polyPhen1 in the set of predictive features, alignment pipeline, and the method of classification is presented and performance, as presented by its receiver operating characteristic curves, was consistently superior.
Abstract: To the Editor: Applications of rapidly advancing sequencing technologies exacerbate the need to interpret individual sequence variants. Sequencing of phenotyped clinical subjects will soon become a method of choice in studies of the genetic causes of Mendelian and complex diseases. New exon capture techniques will direct sequencing efforts towards the most informative and easily interpretable protein-coding fraction of the genome. Thus, the demand for computational predictions of the impact of protein sequence variants will continue to grow. Here we present a new method and the corresponding software tool, PolyPhen-2 (http://genetics.bwh.harvard.edu/pph2/), which is different from the early tool PolyPhen1 in the set of predictive features, alignment pipeline, and the method of classification (Fig. 1a). PolyPhen-2 uses eight sequence-based and three structure-based predictive features (Supplementary Table 1) which were selected automatically by an iterative greedy algorithm (Supplementary Methods). Majority of these features involve comparison of a property of the wild-type (ancestral, normal) allele and the corresponding property of the mutant (derived, disease-causing) allele, which together define an amino acid replacement. Most informative features characterize how well the two human alleles fit into the pattern of amino acid replacements within the multiple sequence alignment of homologous proteins, how distant the protein harboring the first deviation from the human wild-type allele is from the human protein, and whether the mutant allele originated at a hypermutable site2. The alignment pipeline selects the set of homologous sequences for the analysis using a clustering algorithm and then constructs and refines their multiple alignment (Supplementary Fig. 1). The functional significance of an allele replacement is predicted from its individual features (Supplementary Figs. 2–4) by Naive Bayes classifier (Supplementary Methods). Figure 1 PolyPhen-2 pipeline and prediction accuracy. (a) Overview of the algorithm. (b) Receiver operating characteristic (ROC) curves for predictions made by PolyPhen-2 using five-fold cross-validation on HumDiv (red) and HumVar3 (light green). UniRef100 (solid ... We used two pairs of datasets to train and test PolyPhen-2. We compiled the first pair, HumDiv, from all 3,155 damaging alleles with known effects on the molecular function causing human Mendelian diseases, present in the UniProt database, together with 6,321 differences between human proteins and their closely related mammalian homologs, assumed to be non-damaging (Supplementary Methods). The second pair, HumVar3, consists of all the 13,032 human disease-causing mutations from UniProt, together with 8,946 human nsSNPs without annotated involvement in disease, which were treated as non-damaging. We found that PolyPhen-2 performance, as presented by its receiver operating characteristic curves, was consistently superior compared to PolyPhen (Fig. 1b) and it also compared favorably with the three other popular prediction tools4–6 (Fig. 1c). For a false positive rate of 20%, PolyPhen-2 achieves the rate of true positive predictions of 92% and 73% on HumDiv and HumVar, respectively (Supplementary Table 2). One reason for a lower accuracy of predictions on HumVar is that nsSNPs assumed to be non-damaging in HumVar contain a sizable fraction of mildly deleterious alleles. In contrast, most of amino acid replacements assumed non-damaging in HumDiv must be close to selective neutrality. Because alleles that are even mildly but unconditionally deleterious cannot be fixed in the evolving lineage, no method based on comparative sequence analysis is ideal for discriminating between drastically and mildly deleterious mutations, which are assigned to the opposite categories in HumVar. Another reason is that HumDiv uses an extra criterion to avoid possible erroneous annotations of damaging mutations. For a mutation, PolyPhen-2 calculates Naive Bayes posterior probability that this mutation is damaging and reports estimates of false positive (the chance that the mutation is classified as damaging when it is in fact non-damaging) and true positive (the chance that the mutation is classified as damaging when it is indeed damaging) rates. A mutation is also appraised qualitatively, as benign, possibly damaging, or probably damaging (Supplementary Methods). The user can choose between HumDiv- and HumVar-trained PolyPhen-2. Diagnostics of Mendelian diseases requires distinguishing mutations with drastic effects from all the remaining human variation, including abundant mildly deleterious alleles. Thus, HumVar-trained PolyPhen-2 should be used for this task. In contrast, HumDiv-trained PolyPhen-2 should be used for evaluating rare alleles at loci potentially involved in complex phenotypes, dense mapping of regions identified by genome-wide association studies, and analysis of natural selection from sequence data, where even mildly deleterious alleles must be treated as damaging.

11,571 citations


Journal ArticleDOI
02 Jul 2010-Science
TL;DR: The design, synthesis, and assembly of the 1.08–mega–base pair Mycoplasma mycoides JCVI-syn1.0 genome starting from digitized genome sequence information and its transplantation into a M. capricolum recipient cell to create new cells that are controlled only by the synthetic chromosome are reported.
Abstract: We report the design, synthesis, and assembly of the 1.08-mega-base pair Mycoplasma mycoides JCVI-syn1.0 genome starting from digitized genome sequence information and its transplantation into a M. capricolum recipient cell to create new M. mycoides cells that are controlled only by the synthetic chromosome. The only DNA in the cells is the designed synthetic DNA sequence, including "watermark" sequences and other designed gene deletions and polymorphisms, and mutations acquired during the building process. The new cells have expected phenotypic properties and are capable of continuous self-replication.

2,256 citations


Journal ArticleDOI
TL;DR: The Bacterial Isolate Genome Sequence Database (BIGSDB) represents a freely available resource that will assist the broader community in the elucidation of the structure and function of bacteria by means of a population genomics approach.
Abstract: The opportunities for bacterial population genomics that are being realised by the application of parallel nucleotide sequencing require novel bioinformatics platforms These must be capable of the storage, retrieval, and analysis of linked phenotypic and genotypic information in an accessible, scalable and computationally efficient manner The Bacterial Isolate Genome Sequence Database (BIGSDB) is a scalable, open source, web-accessible database system that meets these needs, enabling phenotype and sequence data, which can range from a single sequence read to whole genome data, to be efficiently linked for a limitless number of bacterial specimens The system builds on the widely used mlstdbNet software, developed for the storage and distribution of multilocus sequence typing (MLST) data, and incorporates the capacity to define and identify any number of loci and genetic variants at those loci within the stored nucleotide sequences These loci can be further organised into 'schemes' for isolate characterisation or for evolutionary or functional analyses Isolates and loci can be indexed by multiple names and any number of alternative schemes can be accommodated, enabling cross-referencing of different studies and approaches LIMS functionality of the software enables linkage to and organisation of laboratory samples The data are easily linked to external databases and fine-grained authentication of access permits multiple users to participate in community annotation by setting up or contributing to different schemes within the database Some of the applications of BIGSDB are illustrated with the genera Neisseria and Streptococcus The BIGSDB source code and documentation are available at http://pubmlstorg/software/database/bigsdb/ Genomic data can be used to characterise bacterial isolates in many different ways but it can also be efficiently exploited for evolutionary or functional studies BIGSDB represents a freely available resource that will assist the broader community in the elucidation of the structure and function of bacteria by means of a population genomics approach

1,943 citations


Journal ArticleDOI
TL;DR: TranslatorX is presented, a web server designed to align protein-coding nucleotide sequences based on their corresponding amino acid translations, with a rich output, including Jalview-powered graphical visualization of the alignments, codon-based alignments coloured according to the corresponding amino acids, measures of compositional bias and first, second and third codon position specific alignments.
Abstract: We present TranslatorX, a web server designed to align protein-coding nucleotide sequences based on their corresponding amino acid translations. Many comparisons between biological sequences (nucleic acids and proteins) involve the construction of multiple alignments. Alignments represent a statement regarding the homology between individual nucleotides or amino acids within homologous genes. As protein-coding DNA sequences evolve as triplets of nucleotides (codons) and it is known that sequence similarity degrades more rapidly at the DNA than at the amino acid level, alignments are generally more accurate when based on amino acids than on their corresponding nucleotides. TranslatorX novelties include: (i) use of all documented genetic codes and the possibility of assigning different genetic codes for each sequence; (ii) a battery of different multiple alignment programs; (iii) translation of ambiguous codons when possible; (iv) an innovative criterion to clean nucleotide alignments with GBlocks based on protein information; and (v) a rich output, including Jalview-powered graphical visualization of the alignments, codon-based alignments coloured according to the corresponding amino acids, measures of compositional bias and first, second and third codon position specific alignments. The TranslatorX server is freely available at http://translatorx.co.uk.

1,186 citations


Journal ArticleDOI
TL;DR: An algorithm for gene identification in DNA sequences derived from shotgun sequencing of microbial communities and its accuracy is described and several thousands of new genes could be added to existing annotations of several human and mouse gut metagenomes.
Abstract: We describe an algorithm for gene identification in DNA sequences derived from shotgun sequencing of microbial communities. Accurate ab initio gene prediction in a short nucleotide sequence of anonymous origin is hampered by uncertainty in model parameters. While several machine learning approaches could be proposed to bypass this difficulty, one effective method is to estimate parameters from dependencies, formed in evolution, between frequencies of oligonucleotides in protein-coding regions and genome nucleotide composition. Original version of the method was proposed in 1999 and has been used since for (i) reconstructing codon frequency vector needed for gene finding in viral genomes and (ii) initializing parameters of self-training gene finding algorithms. With advent of new prokaryotic genomes en masse it became possible to enhance the original approach by using direct polynomial and logistic approximations of oligonucleotide frequencies, as well as by separating models for bacteria and archaea. These advances have increased the accuracy of model reconstruction and, subsequently, gene prediction. We describe the refined method and assess its accuracy on known prokaryotic genomes split into short sequences. Also, we show that as a result of application of the new method, several thousands of new genes could be added to existing annotations of several human and mouse gut metagenomes.

1,178 citations


Journal ArticleDOI
30 Apr 2010-Science
TL;DR: Family-based genome analysis enabled us to narrow the candidate genes for both of these Mendelian disorders to only four and demonstrate the value of complete genome sequencing in families.
Abstract: We analyzed the whole-genome sequences of a family of four, consisting of two siblings and their parents. Family-based sequencing allowed us to delineate recombination sites precisely, identify 70% of the sequencing errors (resulting in > 99.999% accuracy), and identify very rare single-nucleotide polymorphisms. We also directly estimated a human intergeneration mutation rate of approximately 1.1 x 10(-8) per position per haploid genome. Both offspring in this family have two recessive disorders: Miller syndrome, for which the gene was concurrently identified, and primary ciliary dyskinesia, for which causative genes have been previously identified. Family-based genome analysis enabled us to narrow the candidate genes for both of these Mendelian disorders to only four. Our results demonstrate the value of complete genome sequencing in families.

1,064 citations


Journal ArticleDOI
TL;DR: A genome-wide phylogeny based on genomes of 21 strains representative of the global diversity and six major lineages of the M. tuberculosis complex showed, as expected, that essential genes in MTBC were more evolutionarily conserved than nonessential genes.
Abstract: Mycobacterium tuberculosis is an obligate human pathogen capable of persisting in individual hosts for decades. We sequenced the genomes of 21 strains representative of the global diversity and six major lineages of the M. tuberculosis complex (MTBC) at 40- to 90-fold coverage using Illumina next-generation DNA sequencing. We constructed a genome-wide phylogeny based on these genome sequences. Comparative analyses of the sequences showed, as expected, that essential genes in MTBC were more evolutionarily conserved than nonessential genes. Notably, however, most of the 491 experimentally confirmed human T cell epitopes showed little sequence variation and had a lower ratio of nonsynonymous to synonymous changes than seen in essential and nonessential genes. We confirmed these findings in an additional data set consisting of 16 antigens in 99 MTBC strains. These findings are consistent with strong purifying selection acting on these epitopes, implying that MTBC might benefit from recognition by human T cells.

621 citations


Journal ArticleDOI
TL;DR: This RNA-Seq atlas extends the analyses of previous gene expression atlases performed using Affymetrix GeneChip technology and provides an example of new methods to accommodate the increase in transcriptome data obtained from next generation sequencing.
Abstract: Next generation sequencing is transforming our understanding of transcriptomes. It can determine the expression level of transcripts with a dynamic range of over six orders of magnitude from multiple tissues, developmental stages or conditions. Patterns of gene expression provide insight into functions of genes with unknown annotation. The RNA Seq-Atlas presented here provides a record of high-resolution gene expression in a set of fourteen diverse tissues. Hierarchical clustering of transcriptional profiles for these tissues suggests three clades with similar profiles: aerial, underground and seed tissues. We also investigate the relationship between gene structure and gene expression and find a correlation between gene length and expression. Additionally, we find dramatic tissue-specific gene expression of both the most highly-expressed genes and the genes specific to legumes in seed development and nodule tissues. Analysis of the gene expression profiles of over 2,000 genes with preferential gene expression in seed suggests there are more than 177 genes with functional roles that are involved in the economically important seed filling process. Finally, the Seq-atlas also provides a means of evaluating existing gene model annotations for the Glycine max genome. This RNA-Seq atlas extends the analyses of previous gene expression atlases performed using Affymetrix GeneChip technology and provides an example of new methods to accommodate the increase in transcriptome data obtained from next generation sequencing. Data contained within this RNA-Seq atlas of Glycine max can be explored at http://www.soybase.org/soyseq .

615 citations


Journal ArticleDOI
TL;DR: This work tracked the performance of >600,000 variants of a human WW domain after three and six rounds of selection by phage display for binding to its peptide ligand, providing a general means for understanding how protein function relates to sequence.
Abstract: We present a large-scale approach to investigate the functional consequences of sequence variation in a protein. The approach entails the display of hundreds of thousands of protein variants, moderate selection for activity, and high throughput DNA sequencing to quantify the performance of each variant. Using this strategy, we tracked the performance of >600,000 variants of a human WW domain after three and six rounds of selection by phage display for binding to its peptide ligand. Binding properties of these variants defined a high-resolution map of mutational preference across the WW domain; each position possessed unique features that could not be captured by a few representative mutations. Our approach could be applied to many in vitro or in vivo protein assays, providing a general means for understanding how protein function relates to sequence.

519 citations


Journal ArticleDOI
TL;DR: This sequencing study of expressed genes from Lodgepole pine, including their assembly and annotation, and their potential for molecular marker development to support population and association genetic studies illustrate the utility of next generation sequencing as a basis for marker development and population genomics in non-model species.
Abstract: Massively parallel sequencing of cDNA is now an efficient route for generating enormous sequence collections that represent expressed genes. This approach provides a valuable starting point for characterizing functional genetic variation in non-model organisms, especially where whole genome sequencing efforts are currently cost and time prohibitive. The large and complex genomes of pines (Pinus spp.) have hindered the development of genomic resources, despite the ecological and economical importance of the group. While most genomic studies have focused on a single species (P. taeda), genomic level resources for other pines are insufficiently developed to facilitate ecological genomic research. Lodgepole pine (P. contorta) is an ecologically important foundation species of montane forest ecosystems and exhibits substantial adaptive variation across its range in western North America. Here we describe a sequencing study of expressed genes from P. contorta, including their assembly and annotation, and their potential for molecular marker development to support population and association genetic studies. We obtained 586,732 sequencing reads from a 454 GS XLR70 Titanium pyrosequencer (mean length: 306 base pairs). A combination of reference-based and de novo assemblies yielded 63,657 contigs, with 239,793 reads remaining as singletons. Based on sequence similarity with known proteins, these sequences represent approximately 17,000 unique genes, many of which are well covered by contig sequences. This sequence collection also included a surprisingly large number of retrotransposon sequences, suggesting that they are highly transcriptionally active in the tissues we sampled. We located and characterized thousands of simple sequence repeats and single nucleotide polymorphisms as potential molecular markers in our assembled and annotated sequences. High quality PCR primers were designed for a substantial number of the SSR loci, and a large number of these were amplified successfully in initial screening. This sequence collection represents a major genomic resource for P. contorta, and the large number of genetic markers characterized should contribute to future research in this and other pines. Our results illustrate the utility of next generation sequencing as a basis for marker development and population genomics in non-model species.

420 citations


Journal Article
TL;DR: MicrobesOnline as mentioned in this paper is a community resource for comparative and functional genome analysis, including a comparative genome browser based on phylogenetic trees for every gene family as well as a species tree.
Abstract: Since 2003, MicrobesOnline (http://www.microbesonline.org) has been providing a community resource for comparative and functional genome analysis. The portal includes over 1000 complete genomes of bacteria, archaea and fungi and thousands of expression microarrays from diverse organisms ranging from model organisms such as Escherichia coli and Saccharomyces cerevisiae to environmental microbes such as Desulfovibrio vulgaris and Shewanella oneidensis. To assist in annotating genes and in reconstructing their evolutionary history, MicrobesOnline includes a comparative genome browser based on phylogenetic trees for every gene family as well as a species tree. To identify co-regulated genes, MicrobesOnline can search for genes based on their expression profile, and provides tools for identifying regulatory motifs and seeing if they are conserved. MicrobesOnline also includes fast phylogenetic profile searches, comparative views of metabolic pathways, operon predictions, a workbench for sequence analysis and integration with RegTransBase and other microbial genome resources. The next update of MicrobesOnline will contain significant new functionality, including comparative analysis of metagenomic sequence data. Programmatic access to the database, along with source code and documentation, is available at http://microbesonline.org/programmers.html.

Journal ArticleDOI
TL;DR: In this paper, the authors applied the degradome sequencing approach to identify small RNA targets in rice, which globally identifies the remnants of small RNA-directed target cleavage by sequencing the 5' ends of uncapped RNAs.
Abstract: MicroRNA (miRNA)-guided target RNA expression is vital for a wide variety of biological processes in eukaryotes. Currently, miRBase (version 13) lists 142 and 353 miRNAs from Arabidopsis and rice (Oryza sativa), respectively. The integration of miRNAs in diverse biological networks relies upon the confirmation of their RNA targets. In contrast with the well-characterized miRNA targets that are cleaved in Arabidopsis, only a few such targets have been confirmed in rice. To identify small RNA targets in rice, we applied the 'degradome sequencing' approach, which globally identifies the remnants of small RNA-directed target cleavage by sequencing the 5' ends of uncapped RNAs. One hundred and sixty targets of 53 miRNA families (24 conserved and 29 rice-specific) and five targets of TAS3-small interfering RNAs (siRNAs) were identified. Surprisingly, an additional conserved target for miR398, which has not been reported so far, has been validated. Besides conserved homologous transcripts, 23 non-conserved genes for nine conserved miRNAs and 56 genes for 29 rice-specific miRNAs were also identified as targets. Besides miRNA targets, the rice degradome contained fragments derived from MIRNA precursors. A closer inspection of these fragments revealed a unique pattern distinct from siRNA-producing loci. This attribute can serve as one of the ancillary criteria for separating miRNAs from siRNAs in plants.

Journal ArticleDOI
TL;DR: Repetitive regions of plant genomes can be efficiently characterized by the presented graph-based analysis and the graph representation of repeats can be further used to assess the variability and evolutionary divergence of repeat families, discover and characterize novel elements, and aid in subsequent assembly of their consensus sequences.
Abstract: The investigation of plant genome structure and evolution requires comprehensive characterization of repetitive sequences that make up the majority of higher plant nuclear DNA. Since genome-wide characterization of repetitive elements is complicated by their high abundance and diversity, novel approaches based on massively-parallel sequencing are being adapted to facilitate the analysis. It has recently been demonstrated that the low-pass genome sequencing provided by a single 454 sequencing reaction is sufficient to capture information about all major repeat families, thus providing the opportunity for efficient repeat investigation in a wide range of species. However, the development of appropriate data mining tools is required in order to fully utilize this sequencing data for repeat characterization. We adapted a graph-based approach for similarity-based partitioning of whole genome 454 sequence reads in order to build clusters made of the reads derived from individual repeat families. The information about cluster sizes was utilized for assessing the proportion and composition of repeats in the genomes of two model species, Pisum sativum and Glycine max, differing in genome size and 454 sequencing coverage. Moreover, statistical analysis and visual inspection of the topology of the cluster graphs using a newly developed program tool, SeqGrapheR, were shown to be helpful in distinguishing basic types of repeats and investigating sequence variability within repeat families. Repetitive regions of plant genomes can be efficiently characterized by the presented graph-based analysis and the graph representation of repeats can be further used to assess the variability and evolutionary divergence of repeat families, discover and characterize novel elements, and aid in subsequent assembly of their consensus sequences.

Journal ArticleDOI
26 Oct 2010-PLOS ONE
TL;DR: A low-cost, high-throughput microbiome profiling method that uses combinatorial sequence tags attached to PCR primers that amplify the rRNA V6 region is developed, showing that the short reads are sufficient to assign organisms to the genus or species level in most cases.
Abstract: We developed a low-cost, high-throughput microbiome profiling method that uses combinatorial sequence tags attached to PCR primers that amplify the rRNA V6 region. Amplified PCR products are sequenced using an Illumina paired-end protocol to generate millions of overlapping reads. Combinatorial sequence tagging can be used to examine hundreds of samples with far fewer primers than is required when sequence tags are incorporated at only a single end. The number of reads generated permitted saturating or near-saturating analysis of samples of the vaginal microbiome. The large number of reads allowed an in-depth analysis of errors, and we found that PCR-induced errors composed the vast majority of non-organism derived species variants, an observation that has significant implications for sequence clustering of similar high-throughput data. We show that the short reads are sufficient to assign organisms to the genus or species level in most cases. We suggest that this method will be useful for the deep sequencing of any short nucleotide region that is taxonomically informative; these include the V3, V5 regions of the bacterial 16S rRNA genes and the eukaryotic V9 region that is gaining popularity for sampling protist diversity.

Journal ArticleDOI
TL;DR: This study redefines and reclassifies the domains of PfEMP1 from seven genomes, and hopes this comprehensive categorization will provide a platform for future studies on var/PfEMP1 expression and function.
Abstract: The var gene encoded hyper-variable Plasmodium falciparum erythrocyte membrane protein 1 (PfEMP1) family mediates cytoadhesion of infected erythrocytes to human endothelium. Antibodies blocking cytoadhesion are important mediators of malaria immunity acquired by endemic populations. The development of a PfEMP1 based vaccine mimicking natural acquired immunity depends on a thorough understanding of the evolved PfEMP1 diversity, balancing antigenic variation against conserved receptor binding affinities. This study redefines and reclassifies the domains of PfEMP1 from seven genomes. Analysis of domains in 399 different PfEMP1 sequences allowed identification of several novel domain classes, and a high degree of PfEMP1 domain compositional order, including conserved domain cassettes not always associated with the established group A–E division of PfEMP1. A novel iterative homology block (HB) detection method was applied, allowing identification of 628 conserved minimal PfEMP1 building blocks, describing on average 83% of a PfEMP1 sequence. Using the HBs, similarities between domain classes were determined, and Duffy binding-like (DBL) domain subclasses were found in many cases to be hybrids of major domain classes. Related to this, a recombination hotspot was uncovered between DBL subdomains S2 and S3. The VarDom server is introduced, from which information on domain classes and homology blocks can be retrieved, and new sequences can be classified. Several conserved sequence elements were found, including: (1) residues conserved in all DBL domains predicted to interact and hold together the three DBL subdomains, (2) potential integrin binding sites in DBLα domains, (3) an acylation motif conserved in group A var genes suggesting N-terminal N-myristoylation, (4) PfEMP1 inter-domain regions proposed to be elastic disordered structures, and (5) several conserved predicted phosphorylation sites. Ideally, this comprehensive categorization of PfEMP1 will provide a platform for future studies on var/PfEMP1 expression and function.

Journal ArticleDOI
TL;DR: The ability to identify regulatory protein binding sites de novo, determine the sequence-dependent binding energy of the proteins that bind these sites, and, importantly, measure the in vivo interaction energy between RNA polymerase and a DNA-bound transcription factor is demonstrated.
Abstract: Cells use protein-DNA and protein-protein interactions to regulate transcription. A biophysical understanding of this process has, however, been limited by the lack of methods for quantitatively characterizing the interactions that occur at specific promoters and enhancers in living cells. Here we show how such biophysical information can be revealed by a simple experiment in which a library of partially mutated regulatory sequences are partitioned according to their in vivo transcriptional activities and then sequenced en masse. Computational analysis of the sequence data produced by this experiment can provide precise quantitative information about how the regulatory proteins at a specific arrangement of binding sites work together to regulate transcription. This ability to reliably extract precise information about regulatory biophysics in the face of experimental noise is made possible by a recently identified relationship between likelihood and mutual information. Applying our experimental and computational techniques to the Escherichia coli lac promoter, we demonstrate the ability to identify regulatory protein binding sites de novo, determine the sequence-dependent binding energy of the proteins that bind these sites, and, importantly, measure the in vivo interaction energy between RNA polymerase and a DNA-bound transcription factor. Our approach provides a generally applicable method for characterizing the biophysical basis of transcriptional regulation by a specified regulatory sequence. The principles of our method can also be applied to a wide range of other problems in molecular biology.

Journal ArticleDOI
TL;DR: This study applied RNA-seq to globally sample transcripts of the cultivated rice Oryza sativa indica and japonica subspecies for resolving the whole-genome transcription profiles and found that approximately 48% of rice genes show alternative splicing patterns, considerably higher than previous estimations.
Abstract: The functional complexity of the rice transcriptome is not yet fully elucidated, despite many studies having reported the use of DNA microarrays. Next-generation DNA sequencing technologies provide a powerful approach for mapping and quantifying the transcriptome, termed RNA sequencing (RNA-seq). In this study, we applied RNA-seq to globally sample transcripts of the cultivated rice Oryza sativa indica and japonica subspecies for resolving the whole-genome transcription profiles. We identified 15,708 novel transcriptional active regions (nTARs), of which 51.7% have no homolog to public protein data and >63% are putative single-exon transcripts, which are highly different from protein-coding genes (<20%). We found that approximately 48% of rice genes show alternative splicing patterns, a percentage considerably higher than previous estimations. On the basis of the available rice gene models, 83.1% (46,472 genes) of the current rice gene models were validated by RNA-seq, and 6228 genes were identified to be extended at the 5' and/or 3' ends by at least 50 bp. Comparative transcriptome analysis demonstrated that 3464 genes exhibited differential expression patterns. The ratio of SNPs with nonsynonymous/synonymous mutations was nearly 1:1.06. In total, we interrogated and compared transcriptomes of the two rice subspecies to reveal the overall transcriptional landscape at maximal resolution.

Journal ArticleDOI
TL;DR: This study demonstrates the usefulness of the RNA-seq technology to study the transcriptional landscape of an organism whose genome has not been fully annotated.
Abstract: Transcription of protein-coding genes in trypanosomes is polycistronic and gene expression is primarily regulated by post-transcriptional mechanisms. Sequence motifs in the untranslated regions regulate mRNA trans-splicing and RNA stability, yet where UTRs begin and end is known for very few genes. We used high-throughput RNAsequencing to determine the genome-wide steady-state mRNA levels (‘transcriptomes’) for � 90% of the genome in two stages of the Trypanosoma brucei life cycle cultured in vitro. Almost 6% of genes were differentially expressed between the two life-cycle stages. We identified 5 0 splice-acceptor sites (SAS) and polyadenylation sites (PAS) for 6959 and 5948 genes, respectively. Most genes have between one and three alternative SAS, but PAS are more dispersed. For 488 genes, SAS were identified downstream of the originally assigned initiator ATG, so a subsequent in-frame ATG presumably designates the start of the true coding sequence. In some cases, alternative SAS would give rise to mRNAs encoding proteins with different N-terminal sequences. We could identify the introns in two genes known to contain them, but found no additional genes with introns. Our study demonstrates the usefulness of the RNA-seq technology to study the transcriptional landscape of an organism whose genome has not been fully annotated.

Journal ArticleDOI
TL;DR: Strain BO1(T) harboured four to five copies of the Brucella-specific insertion element IS 711, displaying a unique banding pattern, and exhibited a unique 16S rRNA gene sequence and also grouped separately in multilocus sequence typing analysis.
Abstract: A Gram-negative, non-motile, non-spore-forming coccoid bacterium (strain BO1(T)) was isolated recently from a breast implant infection of a 71-year-old female patient with clinical signs of brucellosis. Affiliation of strain BO1(T) to the genus Brucella was confirmed by means of polyamine pattern, polar lipid profile, fatty acid profile, quinone system, DNA-DNA hybridization studies and by insertion sequence 711 (IS711)-specific PCR. Strain BO1(T) harboured four to five copies of the Brucella-specific insertion element IS 711, displaying a unique banding pattern, and exhibited a unique 16S rRNA gene sequence and also grouped separately in multilocus sequence typing analysis. Strain BO1(T) reacted with Brucella M-monospecific antiserum. Incomplete lysis was detected with bacteriophages Tb (Tbilisi), F1 and F25. Biochemical profiling revealed a high degree of enzymic activity and metabolic capabilities. In multilocus VNTR (variable-number tandem-repeat) analysis, strain BO1(T) showed a very distinctive profile and clustered with the other 'exotic' Brucella strains, including strains isolated from marine mammals, and Brucella microti, Brucella suis biovar 5 and Brucella neotomae. Comparative omp2a and omp2b gene sequence analysis revealed the most divergent omp2 sequences identified to date for a Brucella strain. The recA gene sequence of strain BO1(T) differed in seven nucleotides from the Brucella recA consensus sequence. Using the Brucella species-specific multiplex PCR assay, strain BO1(T) displayed a unique banding pattern not observed in other Brucella species. From the phenotypic and molecular analysis it became evident that strain BO1( T) was clearly different from all other Brucella species, and therefore represents a novel species within the genus Brucella. Because of its unexpected isolation, the name Brucella inopinata with the type strain BO1(T) (=BCCN 09-01(T)=CPAM 6436(T)) is proposed.

Journal ArticleDOI
TL;DR: An algorithm to localize SV breakpoints by paired-end mapping, and a general approach for the genome-wide assembly and interpretation of breakpoint sequences are developed, which demonstrate that HYDRA accurately maps diverse classes of SV, including those involving repetitive elements such as transposons and segmental duplications.
Abstract: Structural variation (SV) is a rich source of genetic diversity in mammals, but due to the challenges associated with mapping SV in complex genomes, basic questions regarding their genomic distribution and mechanistic origins remain unanswered. We have developed an algorithm (HYDRA) to localize SV breakpoints by paired-end mapping, and a general approach for the genome-wide assembly and interpretation of breakpoint sequences. We applied these methods to two inbred mouse strains: C57BL/6J and DBA/2J. We demonstrate that HYDRA accurately maps diverse classes of SV, including those involving repetitive elements such as transposons and segmental duplications; however, our analysis of the C57BL/6J reference strain shows that incomplete reference genome assemblies are a major source of noise. We report 7196 SVs between the two strains, more than two-thirds of which are due to transposon insertions. Of the remainder, 59% are deletions (relative to the reference), 26% are insertions of unlinked DNA, 9% are tandem duplications, and 6% are inversions. To investigate the origins of SV, we characterized 3316 breakpoint sequences at single-nucleotide resolution. We find that approximately 16% of non-transposon SVs have complex breakpoint patterns consistent with template switching during DNA replication or repair, and that this process appears to preferentially generate certain classes of complex variants. Moreover, we find that SVs are significantly enriched in regions of segmental duplication, but that this effect is largely independent of DNA sequence homology and thus cannot be explained by non-allelic homologous recombination (NAHR) alone. This result suggests that the genetic instability of such regions is often the cause rather than the consequence of duplicated genomic architecture.

Journal ArticleDOI
TL;DR: Next generation sequencing was combined with high-throughput SNP detection assays to quickly discover large numbers of SNPs and those SNPs were then used to create a high resolution genetic map that assisted in the assembly of scaffolds from the 8× whole genome shotgun sequences into pseudomolecules corresponding to chromosomes of the organism.
Abstract: The Soybean Consensus Map 4.0 facilitated the anchoring of 95.6% of the soybean whole genome sequence developed by the Joint Genome Institute, Department of Energy, but its marker density was only sufficient to properly orient 66% of the sequence scaffolds. The discovery and genetic mapping of more single nucleotide polymorphism (SNP) markers were needed to anchor and orient the remaining genome sequence. To that end, next generation sequencing and high-throughput genotyping were combined to obtain a much higher resolution genetic map that could be used to anchor and orient most of the remaining sequence and to help validate the integrity of the existing scaffold builds. A total of 7,108 to 25,047 predicted SNPs were discovered using a reduced representation library that was subsequently sequenced by the Illumina sequence-by-synthesis method on the clonal single molecule array platform. Using multiple SNP prediction methods, the validation rate of these SNPs ranged from 79% to 92.5%. A high resolution genetic map using 444 recombinant inbred lines was created with 1,790 SNP markers. Of the 1,790 mapped SNP markers, 1,240 markers had been selectively chosen to target existing unanchored or un-oriented sequence scaffolds, thereby increasing the amount of anchored sequence to 97%. We have demonstrated how next generation sequencing was combined with high-throughput SNP detection assays to quickly discover large numbers of SNPs. Those SNPs were then used to create a high resolution genetic map that assisted in the assembly of scaffolds from the 8× whole genome shotgun sequences into pseudomolecules corresponding to chromosomes of the organism.

Journal ArticleDOI
TL;DR: The molecular analysis of large contiguous sequences produced from the bread wheat genome provides novel insights into the number, distribution, and density of genes along chromosome 3B and reveals an unexpectedly high amount of noncollinear genes compared to model grass genomes.
Abstract: To improve our understanding of the organization and evolution of the wheat (Triticum aestivum) genome, we sequenced and annotated 13-Mb contigs (18.2 Mb) originating from different regions of its largest chromosome, 3B (1 Gb), and produced a 2x chromosome survey by shotgun Illumina/Solexa sequencing. All regions carried genes irrespective of their chromosomal location. However, gene distribution was not random, with 75% of them clustered into small islands containing three genes on average. A twofold increase of gene density was observed toward the telomeres likely due to high tandem and interchromosomal duplication events. A total of 3222 transposable elements were identified, including 800 new families. Most of them are complete but showed a highly nested structure spread over distances as large as 200 kb. A succession of amplification waves involving different transposable element families led to contrasted sequence compositions between the proximal and distal regions. Finally, with an estimate of 50,000 genes per diploid genome, our data suggest that wheat may have a higher gene number than other cereals. Indeed, comparisons with rice (Oryza sativa) and Brachypodium revealed that a high number of additional noncollinear genes are interspersed within a highly conserved ancestral grass gene backbone, supporting the idea of an accelerated evolution in the Triticeae lineages.

Journal ArticleDOI
TL;DR: A new online pan-genome sequence analysis program, Panseq, which determines the core and accessory regions among a collection of genomic sequences based on user-defined parameters and includes a loci selector that calculates the most variable and discriminatory loci among sets of accessory loci or core gene SNPs.
Abstract: The pan-genome of a bacterial species consists of a core and an accessory gene pool. The accessory genome is thought to be an important source of genetic variability in bacterial populations and is gained through lateral gene transfer, allowing subpopulations of bacteria to better adapt to specific niches. Low-cost and high-throughput sequencing platforms have created an exponential increase in genome sequence data and an opportunity to study the pan-genomes of many bacterial species. In this study, we describe a new online pan-genome sequence analysis program, Panseq. Panseq was used to identify Escherichia coli O157:H7 and E. coli K-12 genomic islands. Within a population of 60 E. coli O157:H7 strains, the existence of 65 accessory genomic regions identified by Panseq analysis was confirmed by PCR. The accessory genome and binary presence/absence data, and core genome and single nucleotide polymorphisms (SNPs) of six L. monocytogenes strains were extracted with Panseq and hierarchically clustered and visualized. The nucleotide core and binary accessory data were also used to construct maximum parsimony (MP) trees, which were compared to the MP tree generated by multi-locus sequence typing (MLST). The topology of the accessory and core trees was identical but differed from the tree produced using seven MLST loci. The Loci Selector module found the most variable and discriminatory combinations of four loci within a 100 loci set among 10 strains in 1 s, compared to the 449 s required to exhaustively search for all possible combinations; it also found the most discriminatory 20 loci from a 96 loci E. coli O157:H7 SNP dataset. Panseq determines the core and accessory regions among a collection of genomic sequences based on user-defined parameters. It readily extracts regions unique to a genome or group of genomes, identifies SNPs within shared core genomic regions, constructs files for use in phylogeny programs based on both the presence/absence of accessory regions and SNPs within core regions and produces a graphical overview of the output. Panseq also includes a loci selector that calculates the most variable and discriminatory loci among sets of accessory loci or core gene SNPs. Panseq is freely available online at http://76.70.11.198/panseq . Panseq is written in Perl.

Journal ArticleDOI
TL;DR: The sequence analysis of U87MG provides an unparalleled level of mutational resolution compared to any cell line to date, and demonstrates that routine generation of broad cancer genome sequence is possible outside of genome centers.
Abstract: U87MG is a commonly studied grade IV glioma cell line that has been analyzed in at least 1,700 publications over four decades. In order to comprehensively characterize the genome of this cell line and to serve as a model of broad cancer genome sequencing, we have generated greater than 30× genomic sequence coverage using a novel 50-base mate paired strategy with a 1.4kb mean insert library. A total of 1,014,984,286 mate-end and 120,691,623 single-end two-base encoded reads were generated from five slides. All data were aligned using a custom designed tool called BFAST, allowing optimal color space read alignment and accurate identification of DNA variants. The aligned sequence reads and mate-pair information identified 35 interchromosomal translocation events, 1,315 structural variations (>100 bp), 191,743 small (<21 bp) insertions and deletions (indels), and 2,384,470 single nucleotide variations (SNVs). Among these observations, the known homozygous mutation in PTEN was robustly identified, and genes involved in cell adhesion were overrepresented in the mutated gene list. Data were compared to 219,187 heterozygous single nucleotide polymorphisms assayed by Illumina 1M Duo genotyping array to assess accuracy: 93.83% of all SNPs were reliably detected at filtering thresholds that yield greater than 99.99% sequence accuracy. Protein coding sequences were disrupted predominantly in this cancer cell line due to small indels, large deletions, and translocations. In total, 512 genes were homozygously mutated, including 154 by SNVs, 178 by small indels, 145 by large microdeletions, and 35 by interchromosomal translocations to reveal a highly mutated cell line genome. Of the small homozygously mutated variants, 8 SNVs and 99 indels were novel events not present in dbSNP. These data demonstrate that routine generation of broad cancer genome sequence is possible outside of genome centers. The sequence analysis of U87MG provides an unparalleled level of mutational resolution compared to any cell line to date.

Journal ArticleDOI
TL;DR: This genome-wide analysis represents the most extensive survey ofEAR motif-containing proteins in Arabidopsis to date and provides a resource enabling investigations into their biological roles and the mechanism of EAR motif-mediated transcriptional regulation.
Abstract: The ethylene-responsive element binding factor-associated amphiphilic repression (EAR) motif is a transcriptional regulatory motif identified in members of the ethylene-responsive element binding factor, C2H2, and auxin/indole-3-acetic acid families of transcriptional regulators. Sequence comparison of the core EAR motif sites from these proteins revealed two distinct conservation patterns: LxLxL and DLNxxP. Proteins containing these motifs play key roles in diverse biological functions by negatively regulating genes involved in developmental, hormonal, and stress signaling pathways. Through a genome-wide bioinformatics analysis, we have identified the complete repertoire of the EAR repressome in Arabidopsis (Arabidopsis thaliana) comprising 219 proteins belonging to 21 different transcriptional regulator families. Approximately 72% of these proteins contain a LxLxL type of EAR motif, 22% contain a DLNxxP type of EAR motif, and the remaining 6% have a motif where LxLxL and DLNxxP are overlapping. Published in vitro and in planta investigations support approximately 40% of these proteins functioning as negative regulators of gene expression. Comparative sequence analysis of EAR motif sites and adjoining regions has identified additional preferred residues and potential posttranslational modification sites that may influence the functionality of the EAR motif. Homology searches against protein databases of poplar (Populus trichocarpa), grapevine (Vitis vinifera), rice (Oryza sativa), and sorghum (Sorghum bicolor) revealed that the EAR motif is conserved across these diverse plant species. This genome-wide analysis represents the most extensive survey of EAR motif-containing proteins in Arabidopsis to date and provides a resource enabling investigations into their biological roles and the mechanism of EAR motif-mediated transcriptional regulation.

Journal ArticleDOI
TL;DR: Pseudomonas aeruginosa PAO1 shows an ongoing microevolution of genotype and phenotype that jeopardizes the reproducibility of research, and high-throughput genome resequencing will resolve more cases and could become a proper quality control for strain collections.
Abstract: Pseudomonas aeruginosa PAO1 is the most commonly used strain for research on this ubiquitous and metabolically versatile opportunistic pathogen. Strain PAO1, a derivative of the original Australian PAO isolate, has been distributed worldwide to laboratories and strain collections. Over decades discordant phenotypes of PAO1 sublines have emerged. Taking the existing PAO1-UW genome sequence (named after the University of Washington, which led the sequencing project) as a blueprint, the genome sequences of reference strains MPAO1 and PAO1-DSM (stored at the German Collection for Microorganisms and Cell Cultures [DSMZ]) were resolved by physical mapping and deep short read sequencing-by-synthesis. MPAO1 has been the source of near-saturation libraries of transposon insertion mutants, and PAO1-DSM is identical in its SpeI-DpnI restriction map with the original isolate. The major genomic differences of MPAO1 and PAO1-DSM in comparison to PAO1-UW are the lack of a large inversion, a duplication of a mobile 12-kb prophage region carrying a distinct integrase and protein phosphatases or kinases, deletions of 3 to 1,006 bp in size, and at least 39 single-nucleotide substitutions, 17 of which affect protein sequences. The PAO1 sublines differed in their ability to cope with nutrient limitation and their virulence in an acute murine airway infection model. Subline PAO1-DSM outnumbered the two other sublines in late stationary growth phase. In conclusion, P. aeruginosa PAO1 shows an ongoing microevolution of genotype and phenotype that jeopardizes the reproducibility of research. High-throughput genome resequencing will resolve more cases and could become a proper quality control for strain collections.

Journal ArticleDOI
TL;DR: A threshold of 13% divergence for VP1 nucleotide sequences for type assignment is proposed, a level that classifies the current dataset of 86 HRV-C VP1 sequences into a total of 33 types, and proposes a subsidiary classification of variants showing > 10% divergence in VP4/VP2, but lackingVP1 sequences, to 28 provisionally assigned types.
Abstract: Human rhinoviruses (HRVs) are common respiratory pathogens associated with mild upper respiratory tract infections, but also increasingly recognized in the aetiology of severe lower respiratory tract disease. Wider use of molecular diagnostics has led to a recent reappraisal of HRV genetic diversity, including the discovery of HRV species C (HRV-C), which is refractory to traditional virus isolation procedures. Although it is heterogeneous genetically, there has to date been no attempt to classify HRV-C into types analogous to the multiple serotypes identified for HRV-A and -B and among human enteroviruses. Direct investigation of cross-neutralization properties of HRV-C is precluded by the lack of methods for in vitro culture, but sequences from the capsid genes (VP1 and partial VP4/VP2) show evidence for marked phylogenetic clustering, suggesting the possibility of a genetically based system comparable to that used for the assignment of new enterovirus types. We propose a threshold of 13% divergence for VP1 nucleotide sequences for type assignment, a level that classifies the current dataset of 86 HRV-C VP1 sequences into a total of 33 types. We recognize, however, that most HRV-C sequence data have been collected in the VP4/VP2 region (currently 701 sequences between positions 615 and 1043). We propose a subsidiary classification of variants showing > 10% divergence in VP4/VP2, but lacking VP1 sequences, to 28 provisionally assigned types (subject to confirmation once VP1 sequences are determined). These proposals will assist in future epidemiological and clinical studies of HRV-C conducted by different groups worldwide, and provide the foundation for future exploration of type-associated differences in clinical presentations and biological properties.

Journal ArticleDOI
TL;DR: The utility of a custom‐designed, exon‐targeted oligonucleotide array to detect intragenic copy‐number changes in patients with various clinical phenotypes is demonstrated.
Abstract: Array comparative genomic hybridization (aCGH) is a powerful tool for the molecular elucidation and diagnosis of disorders resulting from genomic copy-number variation (CNV). However, intragenic deletions or duplications--those including genomic intervals of a size smaller than a gene--have remained beyond the detection limit of most clinical aCGH analyses. Increasing array probe number improves genomic resolution, although higher cost may limit implementation, and enhanced detection of benign CNV can confound clinical interpretation. We designed an array with exonic coverage of selected disease and candidate genes and used it clinically to identify losses or gains throughout the genome involving at least one exon and as small as several hundred base pairs in size. In some patients, the detected copy-number change occurs within a gene known to be causative of the observed clinical phenotype, demonstrating the ability of this array to detect clinically relevant CNVs with subkilobase resolution. In summary, we demonstrate the utility of a custom-designed, exon-targeted oligonucleotide array to detect intragenic copy-number changes in patients with various clinical phenotypes.

Journal ArticleDOI
TL;DR: The assembly of the 2.25-Gb genome of the giant panda from Illumina sequence reads with an average length of just 52 nucleotides is discussed, and some practical aspects such as data filtering and submission of assembly data to public repositories are discussed.
Abstract: A new generation of sequencing technologies is revolutionizing molecular biology. Illumina's Solexa and Applied Biosystems' SOLiD generate gigabases of nucleotide sequence per week. However, a perceived limitation of these ultra-high-throughput technologies is their short read-lengths. De novo assembly of sequence reads generated by classical Sanger capillary sequencing is a mature field of research. Unfortunately, the existing sequence assembly programs were not effective for short sequence reads generated by Illumina and SOLiD platforms. Early studies suggested that, in principle, sequence reads as short as 20-30 nucleotides could be used to generate useful assemblies of both prokaryotic and eukaryotic genome sequences, albeit containing many gaps. The early feasibility studies and proofs of principle inspired several bioinformatics research groups to implement new algorithms as freely available software tools specifically aimed at assembling reads of 30-50 nucleotides in length. This has led to the generation of several draft genome sequences based exclusively on short sequence Illumina sequence reads, recently culminating in the assembly of the 2.25-Gb genome of the giant panda from Illumina sequence reads with an average length of just 52 nucleotides. As well as reviewing recent developments in the field, we discuss some practical aspects such as data filtering and submission of assembly data to public repositories.

Journal ArticleDOI
TL;DR: The biological relevance of lncRNAs would be highly questionable if they were limited to closely related phyla, but their preservation across diverse amniotes, their apparent conservation in exon structure, and similarities in their pattern of brain expression during embryonic and early postnatal stages together indicate that these are functional RNA molecules.
Abstract: Background Long considered to be the building block of life, it is now apparent that protein is only one of many functional products generated by the eukaryotic genome. Indeed, more of the human genome is transcribed into noncoding sequence than into protein-coding sequence. Nevertheless, whilst we have developed a deep understanding of the relationships between evolutionary constraint and function for protein-coding sequence, little is known about these relationships for non-coding transcribed sequence. This dearth of information is partially attributable to a lack of established non-protein-coding RNA (ncRNA) orthologs among birds and mammals within sequence and expression databases.