scispace - formally typeset
Search or ask a question

Showing papers in "Genome Research in 2010"


Journal ArticleDOI
TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Abstract: Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS—the 1000 Genome pilot alone includes nearly five terabases—make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

20,557 citations


Journal ArticleDOI
TL;DR: The development of this de novo short read assembly method creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way.
Abstract: Next-generation massively parallel DNA sequencing technologies provide ultrahigh throughput at a substantially lower unit data cost; however, the data are very short read length sequences, making de novo assembly extremely challenging. Here, we describe a novel method for de novo assembly of large genomes from short read sequences. We successfully assembled both the Asian and African human genome sequences, achieving an N50 contig size of 7.4 and 5.9 kilobases (kb) and scaffold of 446.3 and 61.9 kb, respectively. The development of this de novo short read assembly method creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost-effective way.

2,760 citations


Journal ArticleDOI
TL;DR: By applying phyloP to mammalian multiple alignments from the ENCODE project, it shed light on patterns of conservation/acceleration in known and predicted functional elements, approximate fractions of sites subject to constraint, and differences in clade-specific selection in the primate and glires clades.
Abstract: Methods for detecting nucleotide substitution rates that are faster or slower than expected under neutral drift are widely used to identify candidate functional elements in genomic sequences. However, most existing methods consider either reductions (conservation) or increases (acceleration) in rate but not both, or assume that selection acts uniformly across the branches of a phylogeny. Here we examine the more general problem of detecting departures from the neutral rate of substitution in either direction, possibly in a clade-specific manner. We consider four statistical, phylogenetic tests for addressing this problem: a likelihood ratio test, a score test, a test based on exact distributions of numbers of substitutions, and the genomic evolutionary rate profiling (GERP) test. All four tests have been implemented in a freely available program called phyloP. Based on extensive simulation experiments, these tests are remarkably similar in statistical power. With 36 mammalian species, they all appear to be capable of fairly good sensitivity with low false-positive rates in detecting strong selection at individual nucleotides, moderate selection in 3-bp elements, and weaker or clade-specific selection in longer elements. By applying phyloP to mammalian multiple alignments from the ENCODE project, we shed light on patterns of conservation/acceleration in known and predicted functional elements, approximate fractions of sites subject to constraint, and differences in clade-specific selection in the primate and glires clades. We also describe new "Conservation" tracks in the UCSC Genome Browser that display both phyloP and phastCons scores for genome-wide alignments of 44 vertebrate species.

1,895 citations


Journal ArticleDOI
TL;DR: A whole-genome comparative view of DNA methylation using bisulfite sequencing of three cultured cell types representing progressive stages of differentiation highlights the value of high-resolution methylation maps, in conjunction with other systems-level analyses, for investigation of previously undetectable developmental regulatory mechanisms.
Abstract: DNA methylation is a critical epigenetic regulator in mammalian development. Here, we present a whole-genome comparative view of DNA methylation using bisulfite sequencing of three cultured cell types representing progressive stages of differentiation: human embryonic stem cells (hESCs), a fibroblastic differentiated derivative of the hESCs, and neonatal fibroblasts. As a reference, we compared our maps with a methylome map of a fully differentiated adult cell type, mature peripheral blood mononuclear cells (monocytes). We observed many notable common and cell-type-specific features among all cell types. Promoter hypomethylation (both CG and CA) and higher levels of gene body methylation were positively correlated with transcription in all cell types. Exons were more highly methylated than introns, and sharp transitions of methylation occurred at exon-intron boundaries, suggesting a role for differential methylation in transcript splicing. Developmental stage was reflected in both the level of global methylation and extent of non-CpG methylation, with hESC highest, fibroblasts intermediate, and monocytes lowest. Differentiation-associated differential methylation profiles were observed for developmentally regulated genes, including the HOX clusters, other homeobox transcription factors, and pluripotence-associated genes such as POU5F1, TCF3, and KLF4. Our results highlight the value of high-resolution methylation maps, in conjunction with other systems-level analyses, for investigation of previously undetectable developmental regulatory mechanisms.

1,017 citations


Journal ArticleDOI
TL;DR: The results suggest that like in animals, NMD and RUST may be widespread in plants and may play important roles in regulating gene expression.
Abstract: Alternative splicing can enhance transcriptome plasticity and proteome diversity. In plants, alternative splicing can be manifested at different developmental stages, and is frequently associated with specific tissue types or environmental conditions such as abiotic stress. We mapped the Arabidopsis transcriptome at single-base resolution using the Illumina platform for ultrahigh-throughput RNA sequencing (RNA-seq). Deep transcriptome sequencing confirmed a majority of annotated introns and identified thousands of novel alternatively spliced mRNA isoforms. Our analysis suggests that at least approximately 42% of intron-containing genes in Arabidopsis are alternatively spliced; this is significantly higher than previous estimates based on cDNA/expressed sequence tag sequencing. Random validation confirmed that novel splice isoforms empirically predicted by RNA-seq can be detected in vivo. Novel introns detected by RNA-seq were substantially enriched in nonconsensus terminal dinucleotide splice signals. Alternative isoforms with premature termination codons (PTCs) comprised the majority of alternatively spliced transcripts. Using an example of an essential circadian clock gene, we show that intron retention can generate relatively abundant PTC(+) isoforms and that this specific event is highly conserved among diverse plant species. Alternatively spliced PTC(+) isoforms can be potentially targeted for degradation by the nonsense mediated mRNA decay (NMD) surveillance machinery or regulate the level of functional transcripts by the mechanism of regulated unproductive splicing and translation (RUST). We demonstrate that the relative ratios of the PTC(+) and reference isoforms for several key regulatory genes can be considerably shifted under abiotic stress treatments. Taken together, our results suggest that like in animals, NMD and RUST may be widespread in plants and may play important roles in regulating gene expression.

845 citations


Journal ArticleDOI
TL;DR: ChIP-seq determination of transcription factor binding, in combination with GWA data, provides a powerful approach to further understanding the molecular bases of complex diseases.
Abstract: Initially thought to play a restricted role in calcium homeostasis, the pleiotropic actions of vitamin D in biology and their clinical significance are only now becoming apparent. However, the mode of action of vitamin D, through its cognate nuclear vitamin D receptor (VDR), and its contribution to diverse disorders, remain poorly understood. We determined VDR binding throughout the human genome using chromatin immunoprecipitation followed by massively parallel DNA sequencing (ChIP-seq). After calcitriol stimulation, we identified 2776 genomic positions occupied by the VDR and 229 genes with significant changes in expression in response to vitamin D. VDR binding sites were significantly enriched near autoimmune and cancer associated genes identified from genome-wide association (GWA) studies. Notable genes with VDR binding included IRF8, associated with MS, and PTPN2 associated with Crohn's disease and T1D. Furthermore, a number of single nucleotide polymorphism associations from GWA were located directly within VDR binding intervals, for example, rs13385731 associated with SLE and rs947474 associated with T1D. We also observed significant enrichment of VDR intervals within regions of positive selection among individuals of Asian and European descent. ChIP-seq determination of transcription factor binding, in combination with GWA data, provides a powerful approach to further understanding the molecular bases of complex diseases.

802 citations


Journal ArticleDOI
TL;DR: It is found that the age-PCGT methylation signature is present in preneoplastic conditions and may drive gene expression changes associated with carcinogenesis in normal and cancer solid tissues and a population of bone marrow mesenchymal stem/stromal cells.
Abstract: Polycomb group proteins (PCGs) are involved in repression of genes that are required for stem cell differentiation. Recently, it was shown that promoters of PCG target genes (PCGTs) are 12-fold more likely to be methylated in cancer than non-PCGTs. Age is the most important demographic risk factor for cancer, and we hypothesized that its carcinogenic potential may be referred by irreversibly stabilizing stem cell features. To test this, we analyzed the methylation status of over 27,000 CpGs mapping to promoters of approximately 14,000 genes in whole blood samples from 261 postmenopausal women. We demonstrate that stem cell PCGTs are far more likely to become methylated with age than non-targets (odds ratio = 5.3 [3.8-7.4], P < 10(-10)), independently of sex, tissue type, disease state, and methylation platform. We identified a specific subset of 69 PCGT CpGs that undergo hypermethylation with age and validated this methylation signature in seven independent data sets encompassing over 900 samples, including normal and cancer solid tissues and a population of bone marrow mesenchymal stem/stromal cells (P < 10(-5)). We find that the age-PCGT methylation signature is present in preneoplastic conditions and may drive gene expression changes associated with carcinogenesis. These findings shed substantial novel insights into the epigenetic effects of aging and support the view that age may predispose to malignant transformation by irreversibly stabilizing stem cell features.

801 citations


Journal ArticleDOI
TL;DR: The origin and evolution of new genes and their functions in eukaryotes is reviewed, demonstrating that novel genes of the various types significantly impacted the evolution of cellular, physiological, morphological, behavioral, and reproductive phenotypic traits.
Abstract: Ever since the pre-molecular era, the birth of new genes with novel functions has been considered to be a major contributor to adaptive evolutionary innovation. Here, I review the origin and evolution of new genes and their functions in eukaryotes, an area of research that has made rapid progress in the past decade thanks to the genomics revolution. Indeed, recent work has provided initial whole-genome views of the different types of new genes for a large number of different organisms. The array of mechanisms underlying the origin of new genes is compelling, extending way beyond the traditionally well-studied source of gene duplication. Thus, it was shown that novel genes also regularly arose from messenger RNAs of ancestral genes, protein-coding genes metamorphosed into new RNA genes, genomic parasites were co-opted as new genes, and that both protein and RNA genes were composed from scratch (i.e., from previously nonfunctional sequences). These mechanisms then also contributed to the formation of numerous novel chimeric gene structures. Detailed functional investigations uncovered different evolutionary pathways that led to the emergence of novel functions from these newly minted sequences and, with respect to animals, attributed a potentially important role to one specific tissue--the testis--in the process of gene birth. Remarkably, these studies also demonstrated that novel genes of the various types significantly impacted the evolution of cellular, physiological, morphological, behavioral, and reproductive phenotypic traits. Consequently, it is now firmly established that new genes have indeed been major contributors to the origin of adaptive evolutionary novelties.

691 citations


Journal ArticleDOI
TL;DR: This work reports the first genome-scale study of epigenomic dynamics during normal human aging, identifying aging-associated differentially methylated regions (aDMRs) in whole blood and demonstrating that the aDMR signature is a multitissue phenomenon.
Abstract: In biological terms, aging can be defined as cellular senescence, which results in a diminished ability to respond to stress, increased homeostatic imbalance and risk of diseases such as cancer, and eventually death Research in a variety of organisms has revealed that many factors are involved in the aging process at the molecular level These include telomere-shortening, accumulation of genetic mutations, oxidative stress, and molecular pathways altered by quantitative and qualitative changes in nutrition (for review, see Vijg and Campisi 2008) More recently, several small-scale profiling studies have found directional epigenetic perturbations associated with aging in mammals (Fraga et al 2005; Bjornsson et al 2008; Boks et al 2009; Christensen et al 2009) Epigenetic modifications, such as DNA methylation and post-translational modifications of histone proteins, are indispensable for many aspects of genome function, including gene expression The perturbation of epigenetic landscapes during aging could potentially influence cellular functions, thereby impacting on the development of various aging-associated phenotypes and/or diseases, such as cancer However, until now, we have lacked a genome-scale view of aging-associated epigenomic dynamics Such information would potentially reveal key genomic regions/features or molecular pathways that are susceptible to aging-related epigenetic perturbations Here, we report the first genome-scale study of epigenomic dynamics during normal human aging Our data support a model in which aging-associated differentially methylated regions (aDMRs) that gain methylation with age (hyper-aDMRs) arise in precursor/stem cells preferentially at bivalent chromatin domain promoters This same category of promoters is frequently hypermethylated in cancers and in vitro cell culture, pointing to a novel mechanistic link between aberrant hypermethylation in cancer, aging, and cell culture

680 citations


Journal ArticleDOI
TL;DR: The findings not only identify potentially relevant DNA methylation markers for the clinical characterization of SLE patients but also support the notion that epigenetic changes may be critical in the clinical manifestations of autoimmune disease.
Abstract: Monozygotic (MZ) twins are partially concordant for most complex diseases, including autoimmune disorders. Whereas phenotypic concordance can be used to study heritability, discordance suggests the role of non-genetic factors. In autoimmune diseases, environmentally driven epigenetic changes are thought to contribute to their etiology. Here we report the first high-throughput and candidate sequence analyses of DNA methylation to investigate discordance for autoimmune disease in twins. We used a cohort of MZ twins discordant for three diseases whose clinical signs often overlap: systemic lupus erythematosus (SLE), rheumatoid arthritis, and dermatomyositis. Only MZ twins discordant for SLE featured widespread changes in the DNA methylation status of a significant number of genes. Gene ontology analysis revealed enrichment in categories associated with immune function. Individual analysis confirmed the existence of DNA methylation and expression changes in genes relevant to SLE pathogenesis. These changes occurred in parallel with a global decrease in the 5-methylcytosine content that was concomitantly accompanied with changes in DNA methylation and expression levels of ribosomal RNA genes, although no changes in repetitive sequences were found. Our findings not only identify potentially relevant DNA methylation markers for the clinical characterization of SLE patients but also support the notion that epigenetic changes may be critical in the clinical manifestations of autoimmune disease.

611 citations


Journal ArticleDOI
TL;DR: A likelihood method for detecting selective sweeps that involves jointly modeling the multilocus allele frequency differentiation between two populations is presented, which is much more robust to ascertainment bias in SNP discovery than methods based on the allele frequency spectrum.
Abstract: Selective sweeps can increase genetic differentiation among populations and cause allele frequency spectra to depart from the expectation under neutrality. We present a likelihood method for detecting selective sweeps that involves jointly modeling the multilocus allele frequency differentiation between two populations. We use Brownian motion to model genetic drift under neutrality, and a deterministic model to approximate the effect of a selective sweep on single nucleotide polymorphisms (SNPs) in the vicinity. We test the method with extensive simulated data, and demonstrate that in some scenarios the method provides higher power than previously reported approaches to detect selective sweeps, and can provide surprisingly good localization of the position of a selected allele. A strength of our technique is that it uses allele frequency differentiation between populations, which is much more robust to ascertainment bias in SNP discovery than methods based on the allele frequency spectrum. We apply this method to compare continentally diverse populations, as well as Northern and Southern Europeans. Our analysis identifies a list of loci as candidate targets of selection, including well-known selected loci and new regions that have not been highlighted by previous scans for selection.

Journal ArticleDOI
TL;DR: The results reveal evolutionarily conserved aspects of developmentally regulated replication programs in mammals, demonstrate the power of replication profiling to distinguish closely related cell types, and strongly support the hypothesis that replication timing domains are spatially compartmentalized structural and functional units of three-dimensional chromosomal architecture.
Abstract: To identify evolutionarily conserved features of replication timing and their relationship to epigenetic properties, we profiled replication timing genome-wide in four human embryonic stem cell (hESC) lines, hESC-derived neural precursor cells (NPCs), lymphoblastoid cells, and two human induced pluripotent stem cell lines (hiPSCs), and compared them with related mouse cell types. Results confirm the conservation of coordinately replicated megabase-sized "replication domains" punctuated by origin-suppressed regions. Differentiation-induced replication timing changes in both species occur in 400- to 800-kb units and are similarly coordinated with transcription changes. A surprising degree of cell-type-specific conservation in replication timing was observed across regions of conserved synteny, despite considerable species variation in the alignment of replication timing to isochore GC/LINE-1 content. Notably, hESC replication timing profiles were significantly more aligned to mouse epiblast-derived stem cells (mEpiSCs) than to mouse ESCs. Comparison with epigenetic marks revealed a signature of chromatin modifications at the boundaries of early replicating domains and a remarkably strong link between replication timing and spatial proximity of chromatin as measured by Hi-C analysis. Thus, early and late initiation of replication occurs in spatially separate nuclear compartments, but rarely within the intervening chromatin. Moreover, cell-type-specific conservation of the replication program implies conserved developmental changes in spatial organization of chromatin. Together, our results reveal evolutionarily conserved aspects of developmentally regulated replication programs in mammals, demonstrate the power of replication profiling to distinguish closely related cell types, and strongly support the hypothesis that replication timing domains are spatially compartmentalized structural and functional units of three-dimensional chromosomal architecture.

Journal ArticleDOI
TL;DR: The issues associated with short-read assembly, the different types of data produced by second-gen sequencers, and the latest assembly algorithms designed for these data are described and recommend strategies that will yield a high-quality assembly.
Abstract: Second-generation sequencing technology can now be used to sequence an entire human genome in a matter of days and at low cost. Sequence read lengths, initially very short, have rapidly increased since the technology first appeared, and we now are seeing a growing number of efforts to sequence large genomes de novo from these short reads. In this Perspective, we describe the issues associated with short-read assembly, the different types of data produced by second-gen sequencers, and the latest assembly algorithms designed for these data. We also review the genomes that have been assembled recently from short reads and make recommendations for sequencing strategies that will yield a high-quality assembly.

Journal ArticleDOI
TL;DR: In polygenomic tumors, it is shown that heterogeneity can be ascribed to a few clonal subpopulations, rather than a series of gradual intermediates, and inferred pathways of cancer progression and the organization of tumor growth are inferred.
Abstract: Cancer progression in humans is difficult to infer because we do not routinely sample patients at multiple stages of their disease. However, heterogeneous breast tumors provide a unique opportunity to study human tumor progression because they still contain evidence of early and intermediate subpopulations in the form of the phylogenetic relationships. We have developed a method we call Sector-Ploidy-Profiling (SPP) to study the clonal composition of breast tumors. SPP involves macro-dissecting tumors, flow-sorting genomic subpopulations by DNA content, and profiling genomes using comparative genomic hybridization (CGH). Breast carcinomas display two classes of genomic structural variation: (1) monogenomic and (2) polygenomic. Monogenomic tumors appear to contain a single major clonal subpopulation with a highly stable chromosome structure. Polygenomic tumors contain multiple clonal tumor subpopulations, which may occupy the same sectors, or separate anatomic locations. In polygenomic tumors, we show that heterogeneity can be ascribed to a few clonal subpopulations, rather than a series of gradual intermediates. By comparing multiple subpopulations from different anatomic locations, we have inferred pathways of cancer progression and the organization of tumor growth.

Journal ArticleDOI
TL;DR: The findings demonstrate a surprisingly high rate of hyper- and hypomethylation as a function of age in normal mouse small intestine tissues and a strong tissue-specificity to the process, which concludes that epigenetic deregulation is a common feature of aging in mammals.
Abstract: Aberrant methylation of promoter CpG islands in cancer is associated with silencing of tumor-suppressor genes, and agedependent hypermethylation in normal appearing mucosa may be a risk factor for human colon cancer. It is not known whether this age-related DNA methylation phenomenon is specific to human tissues. We performed comprehensive DNA methylation profiling of promoter regions in aging mouse intestine using methylated CpG island amplification in combination with microarray analysis. By comparing C57BL/6 mice at 3-mo-old versus 35-mo-old for 3627 detectable autosomal genes, we found 774 (21%) that showed increased methylation and 466 (13%) that showed decreased methylation. We used pyrosequencing to quantitatively validate the microarray data and confirmed linear age-related methylation changes for all 12 genomic regions examined. We then examined 11 changed genomic loci for age-related methylation in other tissues. Of these, three of 11 showed similar changes in lung, seven of 11 changed in liver, and six of 11 changed in spleen, though to a lower degree than the changes seen in colon. There was partial conservation between agerelated hypermethylation in human and mouse intestines, and Polycomb targets in embryonic stem cells were enriched among the hypermethylated genes. Our findings demonstrate a surprisingly high rate of hyper- and hypomethylation as a function of age in normal mouse small intestine tissues and a strong tissue-specificity to the process. We conclude that epigenetic deregulation is a common feature of aging in mammals. [Supplemental material is available online at http://www.genome.org.]

Journal ArticleDOI
TL;DR: The first transcriptome atlas for eight organs of cultivated rice is presented, providing extensive evidence that transcriptional regulation in rice is vastly more complex than previously believed.
Abstract: Understanding the dynamics of eukaryotic transcriptome is essential for studying the complexity of transcriptional regulation and its impact on phenotype. However, comprehensive studies of transcriptomes at single base resolution are rare, even for modern organisms, and lacking for rice. Here, we present the first transcriptome atlas for eight organs of cultivated rice. Using high-throughput paired-end RNA-seq, we unambiguously detected transcripts expressing at an extremely low level, as well as a substantial number of novel transcripts, exons, and untranslated regions. An analysis of alternative splicing in the rice transcriptome revealed that alternative cis-splicing occurred in approximately 33% of all rice genes. This is far more than previously reported. In addition, we also identified 234 putative chimeric transcripts that seem to be produced by trans-splicing, indicating that transcript fusion events are more common than expected. In-depth analysis revealed a multitude of fusion transcripts that might be by-products of alternative splicing. Validation and chimeric transcript structural analysis provided evidence that some of these transcripts are likely to be functional in the cell. Taken together, our data provide extensive evidence that transcriptional regulation in rice is vastly more complex than previously believed.

Journal ArticleDOI
TL;DR: A high-throughput method for analyzing transcription factor binding specificity that is based on systematic evolution of ligands by exponential enrichment (SELEX) and massively parallel sequencing is described and reveals unexpected dimeric modes of binding for several factors that were thought to preferentially bind DNA as monomers.
Abstract: The genetic code-the binding specificity of all transfer-RNAs--defines how protein primary structure is determined by DNA sequence. DNA also dictates when and where proteins are expressed, and this information is encoded in a pattern of specific sequence motifs that are recognized by transcription factors. However, the DNA-binding specificity is only known for a small fraction of the approximately 1400 human transcription factors (TFs). We describe here a high-throughput method for analyzing transcription factor binding specificity that is based on systematic evolution of ligands by exponential enrichment (SELEX) and massively parallel sequencing. The method is optimized for analysis of large numbers of TFs in parallel through the use of affinity-tagged proteins, barcoded selection oligonucleotides, and multiplexed sequencing. Data are analyzed by a new bioinformatic platform that uses the hundreds of thousands of sequencing reads obtained to control the quality of the experiments and to generate binding motifs for the TFs. The described technology allows higher throughput and identification of much longer binding profiles than current microarray-based methods. In addition, as our method is based on proteins expressed in mammalian cells, it can also be used to characterize DNA-binding preferences of full-length proteins or proteins requiring post-translational modifications. We validate the method by determining binding specificities of 14 different classes of TFs and by confirming the specificities for NFATC1 and RFX3 using ChIP-seq. Our results reveal unexpected dimeric modes of binding for several factors that were thought to preferentially bind DNA as monomers.

Journal ArticleDOI
TL;DR: A global meta-analysis of previously sampled microbial lineages in the environment is presented, hypothesizing that groupings of lineages are often ancient, and that they may have significantly impacted on genome evolution.
Abstract: Microbes are the most abundant and diverse organisms on Earth. In contrast to macroscopic organisms, their environmental preferences and ecological interdependencies remain difficult to assess, requiring laborious molecular surveys at diverse sampling sites. Here, we present a global meta-analysis of previously sampled microbial lineages in the environment. We grouped publicly available 16S ribosomal RNA sequences into operational taxonomic units at various levels of resolution and systematically searched these for co-occurrence across environments. Naturally occurring microbes, indeed, exhibited numerous, significant interlineage associations. These ranged from relatively specific groupings encompassing only a few lineages, to larger assemblages of microbes with shared habitat preferences. Many of the coexisting lineages were phylogenetically closely related, but a significant number of distant associations were observed as well. The increased availability of completely sequenced genomes allowed us, for the first time, to search for genomic correlates of such ecological associations. Genomes from coexisting microbes tended to be more similar than expected by chance, both with respect to pathway content and genome size, and outliers from these trends are discussed. We hypothesize that groupings of lineages are often ancient, and that they may have significantly impacted on genome evolution.

Journal ArticleDOI
TL;DR: This study illustrates the power of mRNA sequencing for investigating regulatory evolution, provides novel insight into the evolution of gene expression in Drosophila, and reveals general trends that are likely to extend to other species.
Abstract: The regulation of gene expression is critical for organismal function and is an important source of phenotypic diversity between species. Understanding the genetic and molecular mechanisms responsible for regulatory divergence is therefore expected to provide insight into evolutionary change. Using deep sequencing, we quantified total and allele-specific mRNA expression levels genome-wide in two closely related Drosophila species (D. melanogaster and D. sechellia) and their F1 hybrids. Weshowthat78%ofexpressedgenes havedivergent expression between species, andthatcis-andtrans-regulatory divergence affects 51% and 66% of expressed genes, respectively, with 35% of genes showing evidence of both. This is a relatively larger contribution of trans-regulatory divergence than was expected based on prior studies, and may result from the unique demographic history of D. sechellia. Genes with antagonistic cis- and trans-regulatory changes were more likely to be misexpressed in hybrids, consistent with the idea that such regulatory changes contribute to hybrid incompatibilities. In addition, cis-regulatory differences contributed more to divergent expression of genes that showed additive rather than nonadditive inheritance. A correlation between sequence similarity and the conservation of cisregulatory activity was also observed that appears to be a general feature of regulatory evolution. Finally, we examined regulatory divergence that may have contributed to the evolution of a specific trait—divergent feeding behavior in D. sechellia. Overall, this study illustrates the power of mRNA sequencing for investigating regulatory evolution, provides novel insight into the evolution of geneexpression inDrosophila, andreveals general trends that arelikely toextend toother species. [Supplemental material is available online at http://www.genome.org. The sequencing data from this study have been submitted to the NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo) under accession no. GSE20421.] Phenotypic differences between species can arise from genetic changes affecting the function of gene products as well as their expression. Although there has been extensive debate over the relative importance of these two different types of changes (Carroll 2005; Hoekstra and Coyne 2007; Wray 2007; Stern and Orgogozo 2008), they both clearly contribute to phenotypic evolution. Functional divergence of a gene product has historically been much easier to detect than expression divergence; however, advances in methods for measuring gene expression during the last decade have made differences in gene expression much easier to identify. For example, microarray-based studies of gene expression in Drosophila have found that 20% of genes show expression differences between individuals of the same species (Genissel et al. 2008) and 34%–48% of genes show expression divergence between

Journal ArticleDOI
TL;DR: A pipeline for analyzing deep sequencing maps of chromatin structure is introduced and evidence for highly sensitive nucleosomes located within "nucleosome-free regions" is found, suggesting that these regions are not always completely naked but instead are likely associated with easily digesteducleosomes.
Abstract: Genome-wide mapping of nucleosomes has revealed a great deal about the relationships between chromatin structure and control of gene expression, and has led to mechanistic hypotheses regarding the rules by which chromatin structure is established. High-throughput sequencing has recently become the technology of choice for chromatin mapping studies, yet analysis of these experiments is still in its infancy. Here, we introduce a pipeline for analyzing deep sequencing maps of chromatin structure and apply it to data from S. cerevisiae. We analyze a digestion series where nucleosomes are isolated from under- and overdigested chromatin. We find that certain classes of nucleosomes are unusually susceptible or resistant to overdigestion, with promoter nucleosomes easily digested and mid-coding region nucleosomes being quite stable. We find evidence for highly sensitive nucleosomes located within "nucleosome-free regions," suggesting that these regions are not always completely naked but instead are likely associated with easily digested nucleosomes. Finally, since RNA polymerase is the dominant energy-consuming machine that operates on the chromatin template, we analyze changes in chromatin structure when RNA polymerase is inactivated via a temperature-sensitive mutation. We find evidence that RNA polymerase plays a role in nucleosome eviction at promoters and is also responsible for retrograde shifts in nucleosomes during transcription. Loss of RNA polymerase results in a relaxation of chromatin structure to more closely match in vitro nucleosome positioning preferences. Together, these results provide analytical tools and experimental guidance for nucleosome mapping experiments, and help disentangle the interlinked processes of transcription and chromatin packaging.

Journal ArticleDOI
TL;DR: In this paper, the authors studied miRNA profiles in 4419 human samples (3312 neoplastic, 1107 nonmalignant), corresponding to 50 normal tissues and 51 cancer types.
Abstract: We studied miRNA profiles in 4419 human samples (3312 neoplastic, 1107 nonmalignant), corresponding to 50 normal tissues and 51 cancer types. The complexity of our database enabled us to perform a detailed analysis of microRNA (miRNA) activities. We inferred genetic networks from miRNA expression in normal tissues and cancer. We also built, for the first time, specialized miRNA networks for solid tumors and leukemias. Nonmalignant tissues and cancer networks displayed a change in hubs, the most connected miRNAs. hsa-miR-103/106 were downgraded in cancer, whereas hsa-miR-30 became most prominent. Cancer networks appeared as built from disjointed subnetworks, as opposed to normal tissues. A comparison of these nets allowed us to identify key miRNA cliques in cancer. We also investigated miRNA copy number alterations in 744 cancer samples, at a resolution of 150 kb. Members of miRNA families should be similarly deleted or amplified, since they repress the same cellular targets and are thus expected to have similar impacts on oncogenesis. We correctly identified hsa-miR-17/92 family as amplified and the hsa-miR-143/145 cluster as deleted. Other miRNAs, such as hsa-miR-30 and hsa-miR-204, were found to be physically altered at the DNA copy number level as well. By combining differential expression, genetic networks, and DNA copy number alterations, we confirmed, or discovered, miRNAs with comprehensive roles in cancer. Finally, we experimentally validated the miRNA network with acute lymphocytic leukemia originated in Mir155 transgenic mice. Most of miRNAs deregulated in these transgenic mice were located close to hsa-miR-155 in the cancer network.

Journal ArticleDOI
TL;DR: The data show that cohesin cobinds across the genome with transcription factors independently of CTCF, plays a functional role in estrogen-regulated transcription, and may help to mediate tissue-specific transcriptional responses via long-range chromosomal interactions.
Abstract: The cohesin protein complex holds sister chromatids in dividing cells together and is essential for chromosome segregation. Recently, cohesin has been implicated in mediating transcriptional insulation, via its interactions with CTCF. Here, we show in different cell types that cohesin functionally behaves as a tissue-specific transcriptional regulator, independent of CTCF binding. By performing matched genome-wide binding assays (ChIP-seq) in human breast cancer cells (MCF-7), we discovered thousands of genomic sites that share cohesin and estrogen receptor alpha (ER) yet lack CTCF binding. By use of human hepatocellular carcinoma cells (HepG2), we found that liver-specific transcription factors colocalize with cohesin independently of CTCF at liver-specific targets that are distinct from those found in breast cancer cells. Furthermore, estrogen-regulated genes are preferentially bound by both ER and cohesin, and functionally, the silencing of cohesin caused aberrant re-entry of breast cancer cells into cell cycle after hormone treatment. We combined chromosomal interaction data in MCF-7 cells with our cohesin binding data to show that cohesin is highly enriched at ER-bound regions that capture inter-chromosomal loop anchors. Together, our data show that cohesin cobinds across the genome with transcription factors independently of CTCF, plays a functional role in estrogen-regulated transcription, and may help to mediate tissue-specific transcriptional responses via long-range chromosomal interactions.

Journal ArticleDOI
TL;DR: A chromosome-wide survey of ASM was carried out across 16 human pluripotent and adult cell lines using Illumina bisulfite sequencing and a potential role for CpG-SNP is suggested in connecting genetic variation with the epigenome.
Abstract: In diploid mammalian genomes, parental alleles can exhibit different methylation patterns (allele-specific DNA methylation, ASM), which have been documented in a small number of cases except for the imprinted regions and X chromosomes in females. We carried out a chromosome-wide survey of ASM across 16 human pluripotent and adult cell lines using Illumina bisulfite sequencing. We applied the principle of linkage disequilibrium (LD) analysis to characterize the correlation of methylation between adjacent CpG sites on single DNA molecules, and also investigated the correlation between CpG methylation and single nucleotide polymorphisms (SNPs). We observed ASM on 23% approximately 37% heterozygous SNPs in any given cell line. ASM is often cell-type-specific. Furthermore, we found that a significant fraction (38%-88%) of ASM regions is dependent on the presence of heterozygous SNPs in CpG dinucleotides that disrupt their methylation potential. This study identified distinct types of ASM across many cell types and suggests a potential role for CpG-SNP in connecting genetic variation with the epigenome.

Journal ArticleDOI
TL;DR: Phylogenetic analysis indicated that each of the known ecotypes represents a strongly supported clade with divergence times ranging from approximately 150,000 to 700,000 yr ago, and it is predicted that phylogeographic mitogenomics will become an important tool for improved statistical phyloGeography and more precise estimates of divergence times.
Abstract: Killer whales (Orcinus orca) currently comprise a single, cosmopolitan species with a diverse diet. However, studies over the last 30 yr have revealed populations of sympatric ‘‘ecotypes’’ with discrete prey preferences, morphology, and behaviors. Although these ecotypes avoid social interactions and are not known to interbreed, genetic studies to date have found extremely low levels of diversity in the mitochondrial control region, and few clear phylogeographic patterns worldwide. This low level of diversity is likely due to low mitochondrial mutation rates that are common to cetaceans. Using killer whales as a case study, we have developed a method to readily sequence, assemble, and analyze complete mitochondrial genomes from large numbers of samples to more accurately assess phylogeography and estimate divergence times. This represents an important tool for wildlife management, not only for killer whales but for many marine taxa. We used highthroughput sequencing to survey whole mitochondrial genome variation of 139 samples from the North Pacific, North Atlantic, and southern oceans. Phylogenetic analysis indicated that each of the known ecotypes represents a strongly supported clade with divergence times ranging from ;150,000 to 700,000 yr ago. We recommend that three named ecotypes be elevated to full species, and that the remaining types be recognized as subspecies pending additional data. Establishing appropriate taxonomic designations will greatly aid in understanding the ecological impacts and conservation needs of these important marine predators. Wepredict that phylogeographic mitogenomics will become an important tool for improved statistical phylogeography and more precise estimates of divergence times. [Supplemental material is available online at http://www.genome.org. The sequence data from this study have been submitted to GenBank (http://www.ncbi.nlm.nih.gov/genbank) under accession nos. GU187153–GU187164, GU187166– GU187219, and HM060332–HM060334.]

Journal ArticleDOI
TL;DR: Two new methods for substantially improving transcriptome de novo assembly were used to assemble successfully the transcripts of the core set of genes regulating tooth development in vertebrates, while classic de noVO assembly failed.
Abstract: Transcriptome analysis has important applications in many biological fields. However, assembling a transcriptome without a known reference remains a challenging task requiring algorithmic improvements. We present two methods for substantially improving transcriptome de novo assembly. The first method relies on the observation that the use of a single k-mer length by current de novo assemblers is suboptimal to assemble transcriptomes where the sequence coverage of transcripts is highly heterogeneous. We present the Multiple-k method in which various k-mer lengths are used for de novo transcriptome assembly. We demonstrate its good performance by assembling de novo a published next-generation transcriptome sequence data set of Aedes aegypti, using the existing genome to check the accuracy of our method. The second method relies on the use of a reference proteome to improve the de novo assembly. We developed the Scaffolding using Translation Mapping (STM) method that uses mapping against the closest available reference proteome for scaffolding contigs that map onto the same protein. In a controlled experiment using simulated data, we show that the STM method considerably improves the assembly, with few errors. We applied these two methods to assemble the transcriptome of the non-model catfish Loricaria gr. cataphracta. Using the Multiple-k and STM methods, the assembly increases in contiguity and in gene identification, showing that our methods clearly improve quality and can be widely used. The new methods were used to assemble successfully the transcripts of the core set of genes regulating tooth development in vertebrates, while classic de novo assembly failed.

Journal ArticleDOI
TL;DR: It is shown that RNAi-knockdown of ATRX induces a telomere-dysfunction phenotype and significantly reduces CBX5 enrichment at the telomeres, which suggests a novel function of AtRX, working in conjunction with H3.3 and CBX 5, as a key regulator of ES-cell telomer chromatin.
Abstract: ATRX (alpha thalassemia/mental retardation syndrome X-linked) belongs to the SWI2/SNF2 family of chromatin remodeling proteins. Besides the ATPase/helicase domain at its C terminus, it contains a PHD-like zinc finger at the N terminus. Mutations in the ATRX gene are associated with X-linked mental retardation (XLMR) often accompanied by alpha thalassemia (ATRX syndrome). Although ATRX has been postulated to be a transcriptional regulator, its precise roles remain undefined. We demonstrate ATRX localization at the telomeres in interphase mouse embryonic stem (ES) cells in synchrony with the incorporation of H3.3 during telomere replication at S phase. Moreover, we found that chromobox homolog 5 (CBX5) (also known as heterochromatin protein 1 alpha, or HP1 alpha) is also present at the telomeres in ES cells. We show by coimmunoprecipitation that this localization is dependent on the association of ATRX with histone H3.3, and that mutating the K4 residue of H3.3 significantly diminishes ATRX and H3.3 interaction. RNAi-knockdown of ATRX induces a telomere-dysfunction phenotype and significantly reduces CBX5 enrichment at the telomeres. These findings suggest a novel function of ATRX, working in conjunction with H3.3 and CBX5, as a key regulator of ES-cell telomere chromatin.

Journal ArticleDOI
TL;DR: This study reconstructed the primary transcriptome of Sulfolobus solfataricus P2, one of the most widely studied model archaeal organisms, and reveals internal hotspots for transcript cleavage linked to RNA degradation and predict sequence motifs that promote RNA destabilization.
Abstract: Organisms of the third domain of life, the Archaea, share molecular characteristics both with Bacteria and Eukarya. These organisms attract scientific attention as research models for regulation and evolution of processes such as transcription, translation, and RNA processing. We have reconstructed the primary transcriptome of Sulfolobus solfataricus P2, one of the most widely studied model archaeal organisms. Analysis of 625 million bases of sequenced cDNAs yielded a single-base-pair resolution map of transcription start sites and operon structures for more than 1000 transcriptional units. The analysis led to the discovery of 310 expressed noncoding RNAs, with an extensive expression of overlapping cis-antisense transcripts to a level unprecedented in any bacteria or archaea but resembling that of eukaryotes. As opposed to bacterial transcripts, most Sulfolobus transcripts completely lack 5'-UTR sequences, suggesting that mRNA/ncRNA interactions differ between Bacteria and Archaea. The data also reveal internal hotspots for transcript cleavage linked to RNA degradation and predict sequence motifs that promote RNA destabilization. This study highlights transcriptome sequencing as a key tool for understanding the mechanisms and extent of RNA-based regulation in Bacteria and Archaea.

Journal ArticleDOI
TL;DR: 3' addition events are widespread and conserved across animals, PAPD4 is a primary miRNA adenylating enzyme, and a role for 3' adenine addition in modulating miRNA effectiveness is suggested, possibly through interfering with incorporation into the RNA-induced silencing complex (RISC).
Abstract: The genomes of most animal species encode for mature microRNA (miRNA) (Grimson et al. 2008), a distinct class of 20–24-nucleotide (nt) base pair single-stranded noncoding RNA (ncRNA) which post-transcriptionally regulates messenger RNA (mRNA) copy level and translation efficiency through complementary binding of small stretches of base pairs typically in the 3′ untranslated region (UTR). In plants, several studies have implicated 3′ modification of mature miRNA as important to the stability of the miRNA (Li et al. 2005; Ramachandran and Chen 2008; Lu et al. 2009). miRNA deep sequencing experiments based on second-generation sequencing technologies suggested that a fraction of animal miRNAs may be subject to 3′ modification through addition of small numbers of nucleotides (Landgraf et al. 2007). While 3′ modifications in the form of adenine and uridine addition were observed, their biological role and the extent to which these modifications occur on a genome-wide scale remain poorly understood. Recently, the PAPD4 (also referred to as GLD-2) ribonucleotidyltransferase enzyme was shown to add a single adenine residue to liver-specific expressed miR-122 in humans and mice (Katoh et al. 2009). PAPD4 is a so-called “noncanonical” transferase (Martin and Keller 2007) belonging to the TRF family of nucleotidyltransferases (Aravind and Koonin 1999). Individual members of this unusual family of enzymes have displayed remarkable substrate flexibility, having often been implicated in modification of substrates belonging to distinct classes of coding and/or noncoding RNA moieties in the cell (Martin and Keller 2007). In at least one instance, the same enzyme has demonstrated the capacity to catalyze the addition of both uridine and adenine to different substrates (Trippe et al. 1998; Mellman et al. 2008). To better define the global contours of 3′ miRNA addition, we prepared short RNA libraries of human THP-1 monocytic cells and compared these with other publicly available deep-sequenced short RNA libraries. We also constructed libraries that were subject to siRNA-mediated knockdown of different nucleotidyltransferases, including PAPD4. To examine the biological role for adenine addition to the 3′ end of miRNA, we examined their association with Argonaute proteins, a crucial interaction required for miRNAs to perform their regulatory role.

Journal ArticleDOI
TL;DR: It is estimated that mice have significantly fewer escape genes compared with humans, and escape genes are marked by the absence of trimethylation at lysine 27 of histone H3, a chromatin modification associated with genes subject to X inactivation, which is developmentally regulated for some mouse genes.
Abstract: X inactivation equalizes the dosage of gene expression between the sexes, but some genes escape silencing and are thus expressed from both alleles in females. To survey X inactivation and escape in mouse, we performed RNA sequencing in Mus musculus × Mus spretus cells with complete skewing of X inactivation, relying on expression of single nucleotide polymorphisms to discriminate allelic origin. Thirteen of 393 (3.3%) mouse genes had significant expression from the inactive X, including eight novel escape genes. We estimate that mice have significantly fewer escape genes compared with humans. Furthermore, escape genes did not cluster in mouse, unlike the large escape domains in human, suggesting that expression is controlled at the level of individual genes. Our findings are consistent with the striking differences in phenotypes between female mice and women with a single X chromosome—a near normal phenotype in mice versus Turner syndrome and multiple abnormalities in humans. We found that escape genes are marked by the absence of trimethylation at lysine 27 of histone H3, a chromatin modification associated with genes subject to X inactivation. Furthermore, this epigenetic mark is developmentally regulated for some mouse genes.

Journal ArticleDOI
TL;DR: This study applied RNA-seq to globally sample transcripts of the cultivated rice Oryza sativa indica and japonica subspecies for resolving the whole-genome transcription profiles and found that approximately 48% of rice genes show alternative splicing patterns, considerably higher than previous estimations.
Abstract: The functional complexity of the rice transcriptome is not yet fully elucidated, despite many studies having reported the use of DNA microarrays. Next-generation DNA sequencing technologies provide a powerful approach for mapping and quantifying the transcriptome, termed RNA sequencing (RNA-seq). In this study, we applied RNA-seq to globally sample transcripts of the cultivated rice Oryza sativa indica and japonica subspecies for resolving the whole-genome transcription profiles. We identified 15,708 novel transcriptional active regions (nTARs), of which 51.7% have no homolog to public protein data and >63% are putative single-exon transcripts, which are highly different from protein-coding genes (<20%). We found that approximately 48% of rice genes show alternative splicing patterns, a percentage considerably higher than previous estimations. On the basis of the available rice gene models, 83.1% (46,472 genes) of the current rice gene models were validated by RNA-seq, and 6228 genes were identified to be extended at the 5' and/or 3' ends by at least 50 bp. Comparative transcriptome analysis demonstrated that 3464 genes exhibited differential expression patterns. The ratio of SNPs with nonsynonymous/synonymous mutations was nearly 1:1.06. In total, we interrogated and compared transcriptomes of the two rice subspecies to reveal the overall transcriptional landscape at maximal resolution.