scispace - formally typeset
Search or ask a question

Showing papers in "Genome Biology in 2014"


Journal ArticleDOI
TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

47,038 citations


Journal ArticleDOI
TL;DR: New normal linear modeling strategies are presented for analyzing read counts from RNA-seq experiments, and the voom method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline.
Abstract: New normal linear modeling strategies are presented for analyzing read counts from RNA-seq experiments. The voom method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline. This opens access for RNA-seq analysts to a large body of methodology developed for microarrays. Simulation studies show that voom performs as well or better than count-based RNA-seq methods even when the data are generated according to the assumptions of the earlier methods. Two case studies illustrate the use of linear modeling and gene set testing methods.

4,475 citations


Journal ArticleDOI
TL;DR: Kraken is an ultrafast and highly accurate program for assigning taxonomic labels to metagenomic DNA sequences that achieves classification accuracy comparable to the fastest BLAST program.
Abstract: Kraken is an ultrafast and highly accurate program for assigning taxonomic labels to metagenomic DNA sequences. Previous programs designed for this task have been relatively slow and computationally expensive, forcing researchers to use faster abundance estimation programs, which only classify small subsets of metagenomic data. Using exact alignment of k-mers, Kraken achieves classification accuracy comparable to the fastest BLAST program. In its fastest mode, Kraken classifies 100 base pair reads at a rate of over 4.1 million reads per minute, 909 times faster than Megablast and 11 times faster than the abundance estimation program MetaPhlAn. Kraken is available at http://ccb.jhu.edu/software/kraken/.

3,317 citations


Journal ArticleDOI
TL;DR: Model-based Analysis of Genome-wide CRISPR/Cas9 Knockout (MAGeCK) demonstrates better performance compared with existing methods, identifies both positively and negatively selected genes simultaneously, and reports robust results across different experimental conditions.
Abstract: We propose the Model-based Analysis of Genome-wide CRISPR/Cas9 Knockout (MAGeCK) method for prioritizing single-guide RNAs, genes and pathways in genome-scale CRISPR/Cas9 knockout screens. MAGeCK demonstrates better performance compared with existing methods, identifies both positively and negatively selected genes simultaneously, and reports robust results across different experimental conditions. Using public datasets, MAGeCK identified novel essential genes and pathways, including EGFR in vemurafenib-treated A375 cells harboring a BRAF mutation. MAGeCK also detected cell type-specific essential genes, including BCR and ABL1, in KBM7 cells bearing a BCR-ABL fusion, and IGF1R in HL-60 cells, which depends on the insulin signaling pathway for proliferation.

1,439 citations


Journal ArticleDOI
TL;DR: A computational pipeline to identifycircRNAs and quantify their relative abundance from RNA-seq data is developed, providing a new framework for future investigation of this intriguing topological isoform while raising doubts regarding a biological function of most circRNAs.
Abstract: Background: The recent reports of two circular RNAs (circRNAs) with strong potential to act as microRNA (miRNA) sponges suggest that circRNAs might play important roles in regulating gene expression. However, the global properties of circRNAs are not well understood. Results: We developed a computational pipeline to identify circRNAs and quantify their relative abundance from RNA-seq data. Applying this pipeline to a large set of non-poly(A)-selected RNA-seq data from the ENCODE project, we annotated 7,112 human circRNAs that were estimated to comprise at least 10% of the transcripts accumulating from their loci. Most circRNAs are expressed in only a few cell types and at low abundance, but they are no more cell-type-specific than are mRNAs with similar overall expression levels. Although most circRNAs overlap protein-coding sequences, ribosome profiling provides no evidence for their translation. We also annotated 635 mouse circRNAs, and although 20% of them are orthologous to human circRNAs, the sequence conservation of these circRNA orthologs is no higher than that of their neighboring linear exons. The previously proposed miR-7 sponge, CDR1as, is one of only two circRNAs with more miRNA sites than expected by chance, with the next best miRNA-sponge candidate deriving from a gene encoding a primate-specific zinc-finger protein, ZNF91. Conclusions: Our results provide a new framework for future investigation of this intriguing topological isoform while raising doubts regarding a biological function of most circRNAs.

1,293 citations


Journal ArticleDOI
TL;DR: The Harvest suite of core-genome alignment and visualization tools for the rapid and simultaneous analysis of thousands of intraspecific microbial strains is presented, demonstrating that the approach exhibits unrivaled speed while maintaining the accuracy of existing methods.
Abstract: Whole-genome sequences are now available for many microbial species and clades, however existing whole-genome alignment methods are limited in their ability to perform sequence comparisons of multiple sequences simultaneously. Here we present the Harvest suite of core-genome alignment and visualization tools for the rapid and simultaneous analysis of thousands of intraspecific microbial strains. Harvest includes Parsnp, a fast core-genome multi-aligner, and Gingr, a dynamic visual platform. Together they provide interactive core-genome alignments, variant calls, recombination detection, and phylogenetic trees. Using simulated and real data we demonstrate that our approach exhibits unrivaled speed while maintaining the accuracy of existing methods. The Harvest suite is open-source and freely available from: http://github.com/marbl/harvest.

1,186 citations


Journal ArticleDOI
TL;DR: It is concluded that SAD1 dynamically controls splicing efficiency and splice-site recognition in Arabidopsis, and it is proposed that this may contribute to S AD1-mediated stress tolerance through the metabolism of transcripts expressed from stress-responsive genes.
Abstract: Sm-like proteins are highly conserved proteins that form the core of the U6 ribonucleoprotein and function in several mRNA metabolism processes, including pre-mRNA splicing. Despite their wide occurrence in all eukaryotes, little is known about the roles of Sm-like proteins in the regulation of splicing. Here, through comprehensive transcriptome analyses, we demonstrate that depletion of the Arabidopsis supersensitive to abscisic acid and drought 1 gene (SAD1), which encodes Sm-like protein 5 (LSm5), promotes an inaccurate selection of splice sites that leads to a genome-wide increase in alternative splicing. In contrast, overexpression of SAD1 strengthens the precision of splice-site recognition and globally inhibits alternative splicing. Further, SAD1 modulates the splicing of stress-responsive genes, particularly under salt-stress conditions. Finally, we find that overexpression of SAD1 in Arabidopsis improves salt tolerance in transgenic plants, which correlates with an increase in splicing accuracy and efficiency for stress-responsive genes. We conclude that SAD1 dynamically controls splicing efficiency and splice-site recognition in Arabidopsis, and propose that this may contribute to SAD1-mediated stress tolerance through the metabolism of transcripts expressed from stress-responsive genes. Our study not only provides novel insights into the function of Sm-like proteins in splicing, but also uncovers new means to improve splicing efficiency and to enhance stress tolerance in a higher eukaryote.

1,160 citations


Journal ArticleDOI
TL;DR: This review summarizes how, although CDKs are traditionally separated into cell-cycle or transcriptionalCDKs, these activities are frequently combined in many family members.
Abstract: Cyclin-dependent kinases (CDKs) are protein kinases characterized by needing a separate subunit - a cyclin - that provides domains essential for enzymatic activity. CDKs play important roles in the control of cell division and modulate transcription in response to several extra- and intracellular cues. The evolutionary expansion of the CDK family in mammals led to the division of CDKs into three cell-cycle-related subfamilies (Cdk1, Cdk4 and Cdk5) and five transcriptional subfamilies (Cdk7, Cdk8, Cdk9, Cdk11 and Cdk20). Unlike the prototypical Cdc28 kinase of budding yeast, most of these CDKs bind one or a few cyclins, consistent with functional specialization during evolution. This review summarizes how, although CDKs are traditionally separated into cell-cycle or transcriptional CDKs, these activities are frequently combined in many family members. Not surprisingly, deregulation of this family of proteins is a hallmark of several diseases, including cancer, and drug-targeted inhibition of specific members has generated very encouraging results in clinical trials.

1,155 citations


Journal ArticleDOI
TL;DR: It is shown that LUMPY yields improved sensitivity, especially when SV signal is reduced owing to either low coverage data or low intra-sample variant allele frequency, as well as a set of 4,564 validated breakpoints from the NA12878 human genome.
Abstract: Comprehensive discovery of structural variation (SV) from whole genome sequencing data requires multiple detection signals including read-pair, split-read, read-depth and prior knowledge. Owing to technical challenges, extant SV discovery algorithms either use one signal in isolation, or at best use two sequentially. We present LUMPY, a novel SV discovery framework that naturally integrates multiple SV signals jointly across multiple samples. We show that LUMPY yields improved sensitivity, especially when SV signal is reduced owing to either low coverage data or low intra-sample variant allele frequency. We also report a set of 4,564 validated breakpoints from the NA12878 human genome. https://github.com/arq5x/lumpy-sv.

1,125 citations


Journal ArticleDOI
TL;DR: This work examines data from five previously published studies, and finds strong evidence of cell composition change across age in blood, and demonstrates that, in these studies, cellular composition explains much of the observed variability in DNA methylation.
Abstract: Epigenome-wide association studies of human disease and other quantitative traits are becoming increasingly common. A series of papers reporting age-related changes in DNA methylation profiles in peripheral blood have already been published. However, blood is a heterogeneous collection of different cell types, each with a very different DNA methylation profile. Using a statistical method that permits estimating the relative proportion of cell types from DNA methylation profiles, we examine data from five previously published studies, and find strong evidence of cell composition change across age in blood. We also demonstrate that, in these studies, cellular composition explains much of the observed variability in DNA methylation. Furthermore, we find high levels of confounding between age-related variability and cellular composition at the CpG level. Our findings underscore the importance of considering cell composition variability in epigenetic studies based on whole blood and other heterogeneous tissue sources. We also provide software for estimating and exploring this composition confounding for the Illumina 450k microarray.

920 citations


Journal ArticleDOI
TL;DR: Although human-associated microbial communities are generally stable, they can be quickly and profoundly altered by common human actions and experiences, and changes in host fiber intake positively correlated with next-day abundance changes among 15% of gut microbiota members.
Abstract: Disturbance to human microbiota may underlie several pathologies. Yet, we lack a comprehensive understanding of how lifestyle affects the dynamics of human-associated microbial communities. Here, we link over 10,000 longitudinal measurements of human wellness and action to the daily gut and salivary microbiota dynamics of two individuals over the course of one year. These time series show overall microbial communities to be stable for months. However, rare events in each subjects’ life rapidly and broadly impacted microbiota dynamics. Travel from the developed to the developing world in one subject led to a nearly two-fold increase in the Bacteroidetes to Firmicutes ratio, which reversed upon return. Enteric infection in the other subject resulted in the permanent decline of most gut bacterial taxa, which were replaced by genetically similar species. Still, even during periods of overall community stability, the dynamics of select microbial taxa could be associated with specific host behaviors. Most prominently, changes in host fiber intake positively correlated with next-day abundance changes among 15% of gut microbiota members. Our findings suggest that although human-associated microbial communities are generally stable, they can be quickly and profoundly altered by common human actions and experiences.

Journal ArticleDOI
TL;DR: The authors' epigenetic aging signature provides a simple biomarker to estimate the state of aging in blood that facilitates age predictions with a mean absolute deviation from chronological age of less than 5 years, and is higher than age predictions based on telomere length.
Abstract: Background Human aging is associated with DNA methylation changes at specific sites in the genome. These epigenetic modifications may be used to track donor age for forensic analysis or to estimate biological age.

Journal ArticleDOI
TL;DR: The algorithm, functional normalization, is adapted to the Illumina 450k methylation array and outperforms all existing normalization methods with respect to replication of results between experiments, and yields robust results even in the presence of batch effects.
Abstract: We propose an extension to quantile normalization that removes unwanted technical variation using control probes. We adapt our algorithm, functional normalization, to the Illumina 450k methylation array and address the open problem of normalizing methylation data with global epigenetic changes, such as human cancers. Using data sets from The Cancer Genome Atlas and a large case–control study, we show that our algorithm outperforms all existing normalization methods with respect to replication of results between experiments, and yields robust results even in the presence of batch effects. Functional normalization can be applied to any microarray platform, provided suitable control probes are available.

Journal ArticleDOI
TL;DR: A method for the prediction of chemotherapeutic response in patients using only before-treatment baseline tumor gene expression data, validated in three independent clinical trial datasets, and obtained predictions equally good, or better, gene signatures derived directly from clinical data.
Abstract: We demonstrate a method for the prediction of chemotherapeutic response in patients using only before-treatment baseline tumor gene expression data. First, we fitted models for whole-genome gene expression against drug sensitivity in a large panel of cell lines, using a method that allows every gene to influence the prediction. Following data homogenization and filtering, these models were applied to baseline expression levels from primary tumor biopsies, yielding an in vivo drug sensitivity prediction. We validated this approach in three independent clinical trial datasets, and obtained predictions equally good, or better than, gene signatures derived directly from clinical data.

Journal ArticleDOI
TL;DR: This work presents Corset, a method that hierarchically clusters contigs using shared reads and expression, then summarizes read counts to clusters, ready for statistical testing and demonstrates that Corset out-performs alternative methods.
Abstract: Next generation sequencing has made it possible to perform differential gene expression studies in non-model organisms. For these studies, the need for a reference genome is circumvented by performing de novo assembly on the RNA-seq data. However, transcriptome assembly produces a multitude of contigs, which must be clustered into genes prior to differential gene expression detection. Here we present Corset, a method that hierarchically clusters contigs using shared reads and expression, then summarizes read counts to clusters, ready for statistical testing. Using a range of metrics, we demonstrate that Corset out-performs alternative methods. Corset is available from https://code.google.com/p/corset-project/.

Journal ArticleDOI
TL;DR: Genome-wide screening and functional analysis enabled the identification of a set of lncRNAs that are involved in the sexual reproduction of rice, and one lncRNA is demonstrated to play a role in panicle development and fertility.
Abstract: Long noncoding RNAs (lncRNAs) play important roles in a wide range of biological processes in mammals and plants. However, the systematic examination of lncRNAs in plants lags behind that in mammals. Recently, lncRNAs have been identified in Arabidopsis and wheat; however, no systematic screening of potential lncRNAs has been reported for the rice genome. In this study, we perform whole transcriptome strand-specific RNA sequencing (ssRNA-seq) of samples from rice anthers, pistils, and seeds 5 days after pollination and from shoots 14 days after germination. Using these data, together with 40 available rice RNA-seq datasets, we systematically analyze rice lncRNAs and definitively identify lncRNAs that are involved in the reproductive process. The results show that rice lncRNAs have some different characteristics compared to those of Arabidopsis and mammals and are expressed in a highly tissue-specific or stage-specific manner. We further verify the functions of a set of lncRNAs that are preferentially expressed in reproductive stages and identify several lncRNAs as competing endogenous RNAs (ceRNAs), which sequester miR160 or miR164 in a type of target mimicry. More importantly, one lncRNA, XLOC_057324, is demonstrated to play a role in panicle development and fertility. We also develop a source of rice lncRNA-associated insertional mutants. Genome-wide screening and functional analysis enabled the identification of a set of lncRNAs that are involved in the sexual reproduction of rice. The results also provide a source of lncRNAs and associated insertional mutants in rice.

Journal ArticleDOI
Catherine A. Brownstein1, Alan H. Beggs1, Nils Homer, Barry Merriman2  +207 moreInstitutions (53)
TL;DR: The CLARITY Challenge provides a comprehensive assessment of current practices for using genome sequencing to diagnose and report genetic diseases and reveals a general convergence of practices on most elements of the analysis and interpretation process.
Abstract: Background There is tremendous potential for genome sequencing to improve clinical diagnosis and care once it becomes routinely accessible, but this will require formalizing research methods into clinical best practices in the areas of sequence data generation, analysis, interpretation and reporting. The CLARITY Challenge was designed to spur convergence in methods for diagnosing genetic disease starting from clinical case history and genome sequencing data. DNA samples were obtained from three families with heritable genetic disorders and genomic sequence data were donated by sequencing platform vendors. The challenge was to analyze and interpret these data with the goals of identifying disease-causing variants and reporting the findings in a clinically useful format. Participating contestant groups were solicited broadly, and an independent panel of judges evaluated their performance.

Journal ArticleDOI
TL;DR: In this paper, the authors used a whole genome shotgun approach relying on next generation sequence generated from a single haploid seed megagametophyte from a loblolly pine tree, 20-1010, that has been used in industrial forest tree breeding.
Abstract: The size and complexity of conifer genomes has, until now, prevented full genome sequencing and assembly. The large research community and economic importance of loblolly pine, Pinus taeda L., made it an early candidate for reference sequence determination. We develop a novel strategy to sequence the genome of loblolly pine that combines unique aspects of pine reproductive biology and genome assembly methodology. We use a whole genome shotgun approach relying primarily on next generation sequence generated from a single haploid seed megagametophyte from a loblolly pine tree, 20-1010, that has been used in industrial forest tree breeding. The resulting sequence and assembly was used to generate a draft genome spanning 23.2 Gbp and containing 20.1 Gbp with an N50 scaffold size of 66.9 kbp, making it a significant improvement over available conifer genomes. The long scaffold lengths allow the annotation of 50,172 gene models with intron lengths averaging over 2.7 kbp and sometimes exceeding 100 kbp in length. Analysis of orthologous gene sets identifies gene families that may be unique to conifers. We further characterize and expand the existing repeat library based on the de novo analysis of the repetitive content, estimated to encompass 82% of the genome. In addition to its value as a resource for researchers and breeders, the loblolly pine genome sequence and assembly reported here demonstrates a novel approach to sequencing the large and complex genomes of this important group of plants that can now be widely applied.

Journal ArticleDOI
TL;DR: In this article, the authors report the joint analysis of sequence variants, gene expression and DNA methylation in primary fibroblast samples derived from a set of 62 unrelated individuals, with considerable involvement of chromatin features and some discernible involvement of sequence variation.
Abstract: DNA methylation plays an essential role in the regulation of gene expression. While its presence near the transcription start site of a gene has been associated with reduced expression, the variation in methylation levels across individuals, its environmental or genetic causes, and its association with gene expression remain poorly understood. We report the joint analysis of sequence variants, gene expression and DNA methylation in primary fibroblast samples derived from a set of 62 unrelated individuals. Approximately 2% of the most variable CpG sites are mappable in cis to sequence variation, usually within 5 kb. Via eQTL analysis with microarray data combined with mapping of allelic expression regions, we obtained a set of 2,770 regions mappable in cis to sequence variation. In 9.5% of these expressed regions, an associated SNP was also a methylation QTL. Methylation and gene expression are often correlated without direct discernible involvement of sequence variation, but not always in the expected direction of negative for promoter CpGs and positive for gene-body CpGs. Population-level correlation between methylation and expression is strongest in a subset of developmentally significant genes, including all four HOX clusters. The presence and sign of this correlation are best predicted using specific chromatin marks rather than position of the CpG site with respect to the gene. Our results indicate a wide variety of relationships between gene expression, DNA methylation and sequence variation in untransformed adult human fibroblasts, with considerable involvement of chromatin features and some discernible involvement of sequence variation.

Journal ArticleDOI
TL;DR: Intriguingly, the inheritance of lncRNA expression patterns in 105 recombinant inbred lines reveals apparent transgressive segregation, and maize lncRNAs are less affected by cis- than by trans-genetic factors.
Abstract: Background: Long non-coding RNAs (lncRNAs) are transcripts that are 200 bp or longer, do not encode proteins, and potentially play important roles in eukaryotic gene regulation. However, the number, characteristics and expression inheritance pattern of lncRNAs in maize are still largely unknown. Results: By exploiting available public EST databases, maize whole genome sequence annotation and RNA-seq datasets from 30 different experiments, we identified 20,163 putative lncRNAs. Of these lncRNAs, more than 90% are predicted to be the precursors of small RNAs, while 1,704 are considered to be high-confidence lncRNAs. High confidence lncRNAs have an average transcript length of 463 bp and genes encoding them contain fewer exons than annotated genes. By analyzing the expression pattern of these lncRNAs in 13 distinct tissues and 105 maize recombinant inbred lines, we show that more than 50% of the high confidence lncRNAs are expressed in a tissue-specific manner, a result that is supported by epigenetic marks. Intriguingly, the inheritance of lncRNA expression patterns in 105 recombinant inbred lines reveals apparent transgressive segregation, and maize lncRNAs are less affected by cis- than by trans-genetic factors. Conclusions: We integrate all available transcriptomic datasets to identify a comprehensive set of maize lncRNAs, provide a unique annotation resource of the maize genome and a genome-wide characterization of maize lncRNAs, and explore the genetic control of their expression using expression quantitative trait locus mapping. Background While the central dogma defines the primary role for RNA as a messenger molecule in the process of gene expression, there is ample evidence for additional functions of RNA molecules. These RNA molecules include small nuclear RNAs (snRNAs), small nucleolar RNAs (snoRNAs; mainly tRNAs and rRNAs), signal recognition particle (7SL/SRP) RNAs, microRNAs (miRNAs), small interfering RNAs (siRNAs), piwi RNAs (piRNAs) and trans-acting siRNAs (ta-siRNAs), natural cis-acting siRNAs and long noncoding RNAs (lncRNAs). lncRNAs have been arbitrarily defined as non-protein coding RNAs more than 200 bp in length, distinguishing them from short noncoding RNAs such as miRNAs and

Journal ArticleDOI
TL;DR: Differential expression of the triplicated syntelogs and cytosine methylation levels across the sub-genomes suggest residual marks of the genome dominance that led to the current genome architecture, and epigenetic mechanisms play a role in the functional diversification of duplicate genes.
Abstract: Background: Brassica oleracea is a valuable vegetable species that has contributed to human health and nutrition for hundreds of years and comprises multiple distinct cultivar groups with diverse morphological and phytochemical attributes. In addition to this phenotypic wealth, B. oleracea offers unique insights into polyploid evolution, as it results from multiple ancestral polyploidy events and a final Brassiceae-specific triplication event. Further, B. oleracea represents one of the diploid genomes that formed the economically important allopolyploid oilseed, Brassica napus. A deeper understanding of B. oleracea genome architecture provides a foundation for crop improvement strategies throughout the Brassica genus. Results: We generate an assembly representing 75% of the predicted B. oleracea genome using a hybrid Illumina/ Roche 454 approach. Two dense genetic maps are generated to anchor almost 92% of the assembled scaffolds to nine pseudo-chromosomes. Over 50,000 genes are annotated and 40% of the genome predicted to be repetitive, thus contributing to the increased genome size of B. oleracea compared to its close relative B. rapa. A snapshot of both the leaf transcriptome and methylome allows comparisons to be made across the triplicated sub-genomes, which resulted from the most recent Brassiceae-specific polyploidy event. Conclusions: Differential expression of the triplicated syntelogs and cytosine methylation levels across the sub-genomes suggest residual marks of the genome dominance that led to the current genome architecture. Although cytosine methylation does not correlate with individual gene dominance, the independent methylation patterns of triplicated copies suggest epigenetic mechanisms play a role in the functional diversification of duplicate genes.

Journal ArticleDOI
TL;DR: This study reveals a series of complex adaptations of the brown planthopper involving a variety of biological processes, that result in its highly destructive impact on the exclusive host rice.
Abstract: The brown planthopper, Nilaparvata lugens, the most destructive pest of rice, is a typical monophagous herbivore that feeds exclusively on rice sap, which migrates over long distances. Outbreaks of it have re-occurred approximately every three years in Asia. It has also been used as a model system for ecological studies and for developing effective pest management. To better understand how a monophagous sap-sucking arthropod herbivore has adapted to its exclusive host selection and to provide insights to improve pest control, we analyzed the genomes of the brown planthopper and its two endosymbionts. We describe the 1.14 gigabase planthopper draft genome and the genomes of two microbial endosymbionts that permit the planthopper to forage exclusively on rice fields. Only 40.8% of the 27,571 identified Nilaparvata protein coding genes have detectable shared homology with the proteomes of the other 14 arthropods included in this study, reflecting large-scale gene losses including in evolutionarily conserved gene families and biochemical pathways. These unique genomic features are functionally associated with the animal’s exclusive plant host selection. Genes missing from the insect in conserved biochemical pathways that are essential for its survival on the nutritionally imbalanced sap diet are present in the genomes of its microbial endosymbionts, which have evolved to complement the mutualistic nutritional needs of the host. Our study reveals a series of complex adaptations of the brown planthopper involving a variety of biological processes, that result in its highly destructive impact on the exclusive host rice. All these findings highlight potential directions for effective pest control of the planthopper.

Journal ArticleDOI
TL;DR: The findings suggest that temporal dynamics may need to be considered when attempting to link changes in microbiome structure to changes in health status, and show that not only is the composition of an individual's microbiome highly personalized, but their degree of temporal variability is also a personalized feature.
Abstract: Background: It is now apparent that the complex microbial communities found on and in the human body vary across individuals. What has largely been missing from previous studies is an understanding of how these communities vary over time within individuals. To the extent to which it has been considered, it is often assumed that temporal variability is negligible for healthy adults. Here we address this gap in understanding by profiling the forehead, gut (fecal), palm, and tongue microbial communities in 85 adults, weekly over 3 months. Results: We found that skin (forehead and palm) varied most in the number of taxa present, whereas gut and tongue communities varied more in the relative abundances of taxa. Within each body habitat, there was a wide range of temporal variability across the study population, with some individuals harboring more variable communities than others. The best predictor of these differences in variability across individuals was microbial diversity; individuals with more diverse gut or tongue communities were more stable in composition than individuals with less diverse communities. Conclusions: Longitudinal sampling of a relatively large number of individuals allowed us to observe high levels of temporal variability in both diversity and community structure in all body habitats studied. These findings suggest that temporal dynamics may need to be considered when attempting to link changes in microbiome structure to changes in health status. Furthermore, our findings show that, not only is the composition of an individual’s microbiome highly personalized, but their degree of temporal variability is also a personalized feature.

Journal ArticleDOI
TL;DR: This genome-wide methylation profiling study identified tissue-specific differentially methylated regions in 17 human somatic tissues, and a clear inverse correlation is observed between promoter methylation within CpG islands and gene expression data obtained from publicly available databases.
Abstract: DNA epigenetic modifications, such as methylation, are important regulators of tissue differentiation, contributing to processes of both development and cancer. Profiling the tissue-specific DNA methylome patterns will provide novel insights into normal and pathogenic mechanisms, as well as help in future epigenetic therapies. In this study, 17 somatic tissues from four autopsied humans were subjected to functional genome analysis using the Illumina Infinium HumanMethylation450 BeadChip, covering 486 428 CpG sites. Only 2% of the CpGs analyzed are hypermethylated in all 17 tissue specimens; these permanently methylated CpG sites are located predominantly in gene-body regions. In contrast, 15% of the CpGs are hypomethylated in all specimens and are primarily located in regions proximal to transcription start sites. A vast number of tissue-specific differentially methylated regions are identified and considered likely mediators of tissue-specific gene regulatory mechanisms since the hypomethylated regions are closely related to known functions of the corresponding tissue. Finally, a clear inverse correlation is observed between promoter methylation within CpG islands and gene expression data obtained from publicly available databases. This genome-wide methylation profiling study identified tissue-specific differentially methylated regions in 17 human somatic tissues. Many of the genes corresponding to these differentially methylated regions contribute to tissue-specific functions. Future studies may use these data as a reference to identify markers of perturbed differentiation and disease-related pathogenic mechanisms.

Journal ArticleDOI
TL;DR: It is found that integrating a number of computational methods to detect genes with differentially retained introns provides a strategy to enrich for alternatively spliced exons in mammalian RNA-seq data, when complemented by RNA- sequencing analysis of purified cells with experimentally perturbed RNA-binding proteins.
Abstract: Retention of a subset of introns in spliced polyadenylated mRNA is emerging as a frequent, unexplained finding from RNA deep sequencing in mammalian cells. Here we analyze intron retention in T lymphocytes by deep sequencing polyadenylated RNA. We show a developmentally regulated RNA-binding protein, hnRNPLL, induces retention of specific introns by sequencing RNA from T cells with an inactivating Hnrpll mutation and from B lymphocytes that physiologically downregulate Hnrpll during their differentiation. In Ptprc mRNA encoding the tyrosine phosphatase CD45, hnRNPLL induces selective retention of introns flanking exons 4 to 6; these correspond to the cassette exons containing hnRNPLL binding sites that are skipped in cells with normal, but not mutant or low, hnRNPLL. We identify similar patterns of hnRNPLL-induced differential intron retention flanking alternative exons in 14 other genes, representing novel elements of the hnRNPLL-induced splicing program in T cells. Retroviral expression of a normally spliced cDNA for one of these targets, Senp2, partially corrects the survival defect of Hnrpll-mutant T cells. We find that integrating a number of computational methods to detect genes with differentially retained introns provides a strategy to enrich for alternatively spliced exons in mammalian RNA-seq data, when complemented by RNA-seq analysis of purified cells with experimentally perturbed RNA-binding proteins. Our findings demonstrate that intron retention in mRNA is induced by specific RNA-binding proteins and suggest a biological significance for this process in marking exons that are poised for alternative splicing.

Journal ArticleDOI
TL;DR: The findings suggest that many lncRNAs may have a function in cytoplasmic processes, and in particular in ribosome complexes.
Abstract: Long noncoding RNAs (lncRNAs) form an abundant class of transcripts, but the function of the majority of them remains elusive. While it has been shown that some lncRNAs are bound by ribosomes, it has also been convincingly demonstrated that these transcripts do not code for proteins. To obtain a comprehensive understanding of the extent to which lncRNAs bind ribosomes, we performed systematic RNA sequencing on ribosome-associated RNA pools obtained through ribosomal fractionation and compared the RNA content with nuclear and (non-ribosome bound) cytosolic RNA pools. The RNA composition of the subcellular fractions differs significantly from each other, but lncRNAs are found in all locations. A subset of specific lncRNAs is enriched in the nucleus but surprisingly the majority is enriched in the cytosol and in ribosomal fractions. The ribosomal enriched lncRNAs include H19 and TUG1. Most studies on lncRNAs have focused on the regulatory function of these transcripts in the nucleus. We demonstrate that only a minority of all lncRNAs are nuclear enriched. Our findings suggest that many lncRNAs may have a function in cytoplasmic processes, and in particular in ribosome complexes.

Journal ArticleDOI
TL;DR: A computational framework to annotate and prioritize noncoding drivers from thousands of somatic alterations in a typical tumor, FunSeq2, which combines an adjustable data context integrating large-scale genomics and cancer resources with a streamlined variant-prioritization pipeline.
Abstract: Identification of noncoding drivers from thousands of somatic alterations in a typical tumor is a difficult and unsolved problem. We report a computational framework, FunSeq2, to annotate and prioritize these mutations. The framework combines an adjustable data context integrating large-scale genomics and cancer resources with a streamlined variant-prioritization pipeline. The pipeline has a weighted scoring system combining: inter- and intra-species conservation; loss- and gain-of-function events for transcription-factor binding; enhancer-gene linkages and network centrality; and per-element recurrence across samples. We further highlight putative drivers with information specific to a particular sample, such as differential expression. FunSeq2 is available from funseq2.gersteinlab.org.

Journal ArticleDOI
TL;DR: Overall genus-level microbiota composition exhibit a shift in controls from low to high levels of Prevotella and in MSD cases from high to low levels of Escherichia/Shigella in younger versus older children; however, there was significant variation among many genera by both site and age.
Abstract: Background: Diarrheal diseases continue to contribute significantly to morbidity and mortality in infants and young children in developing countries. There is an urgent need to better understand the contributions of novel, potentially uncultured, diarrheal pathogens to severe diarrheal disease, as well as distortions in normal gut microbiota composition that might facilitate severe disease. Results: We use high throughput 16S rRNA gene sequencing to compare fecal microbiota composition in children under five years of age who have been diagnosed with moderate to severe diarrhea (MSD) with the microbiota from diarrhea-free controls. Our study includes 992 children from four low-income countries in West and East Africa, and Southeast Asia. Known pathogens, as well as bacteria currently not considered as important diarrhea-causing pathogens, are positively associated with MSD, and these include Escherichia/Shigella, and Granulicatella species, and Streptococcus mitis/pneumoniae groups. In both cases and controls, there tend to be distinct negative correlations between facultative anaerobic lineages and obligate anaerobic lineages. Overall genus-level microbiota composition exhibit a shift in controls from low to high levels of Prevotella and in MSD cases from high to low levels of Escherichia/Shigella in younger versus older children; however, there was significant variation among many genera by both site and age. Conclusions: Our findings expand the current understanding of microbiota-associated diarrhea pathogenicity in young children from developing countries. Our findings are necessarily based on correlative analyses and must be further validated through epidemiological and molecular techniques.

Journal ArticleDOI
TL;DR: Colorectal cancer primary tumors and metastases exhibit high genomic concordance, and consistency between targeted sequencing and whole genome sequencing results suggests that targeted sequencing may be a suitable strategy for clinical diagnostic applications.
Abstract: Colorectal cancer is the second leading cause of cancer death in the United States, with over 50,000 deaths estimated in 2014 Molecular profiling for somatic mutations that predict absence of response to anti-EGFR therapy has become standard practice in the treatment of metastatic colorectal cancer; however, the quantity and type of tissue available for testing is frequently limited Further, the degree to which the primary tumor is a faithful representation of metastatic disease has been questioned As next-generation sequencing technology becomes more widely available for clinical use and additional molecularly targeted agents are considered as treatment options in colorectal cancer, it is important to characterize the extent of tumor heterogeneity between primary and metastatic tumors We performed deep coverage, targeted next-generation sequencing of 230 key cancer-associated genes for 69 matched primary and metastatic tumors and normal tissue Mutation profiles were 100% concordant for KRAS, NRAS, and BRAF, and were highly concordant for recurrent alterations in colorectal cancer Additionally, whole genome sequencing of four patient trios did not reveal any additional site-specific targetable alterations Colorectal cancer primary tumors and metastases exhibit high genomic concordance As current clinical practices in colorectal cancer revolve around KRAS, NRAS, and BRAF mutation status, diagnostic sequencing of either primary or metastatic tissue as available is acceptable for most patients Additionally, consistency between targeted sequencing and whole genome sequencing results suggests that targeted sequencing may be a suitable strategy for clinical diagnostic applications

Journal ArticleDOI
TL;DR: The study of single cancer cells has transformed from qualitative microscopic images to quantitative genomic datasets, fueled by the development of single-cell sequencing technologies, which provide a powerful new approach to study complex biological processes in human cancers.
Abstract: The study of single cancer cells has transformed from qualitative microscopic images to quantitative genomic datasets. This paradigm shift has been fueled by the development of single-cell sequencing technologies, which provide a powerful new approach to study complex biological processes in human cancers.