scispace - formally typeset
Search or ask a question

Showing papers in "Genome Biology in 2011"


Journal ArticleDOI
TL;DR: A new method for metagenomic biomarker discovery is described and validates by way of class comparison, tests of biological consistency and effect size estimation to address the challenge of finding organisms, genes, or pathways that consistently explain the differences between two or more microbial communities.
Abstract: This study describes and validates a new method for metagenomic biomarker discovery by way of class comparison, tests of biological consistency and effect size estimation. This addresses the challenge of finding organisms, genes, or pathways that consistently explain the differences between two or more microbial communities, which is a central problem to the study of metagenomics. We extensively validate our method on several microbiomes and a convenient online interface for the method is provided at http://huttenhower.sph.harvard.edu/lefse/.

9,057 citations


Journal ArticleDOI
TL;DR: By separating SCNA profiles into underlying arm-level and focal alterations, the estimation of background rates for each category is improved, and a probabilistic method for defining the boundaries of selected-for SCNA regions with user-defined confidence is described.
Abstract: We describe methods with enhanced power and specificity to identify genes targeted by somatic copy-number alterations (SCNAs) that drive cancer growth. By separating SCNA profiles into underlying arm-level and focal alterations, we improve the estimation of background rates for each category. We additionally describe a probabilistic method for defining the boundaries of selected-for SCNA regions with user-defined confidence. Here we detail this revised computational approach, GISTIC2.0, and validate its performance in real and simulated datasets.

2,392 citations


Journal ArticleDOI
TL;DR: Improvements in expression estimates as measured by correlation with independently performed qRT-PCR are found and correction of bias leads to improved replicability of results across libraries and sequencing technologies.
Abstract: The biochemistry of RNA-Seq library preparation results in cDNA fragments that are not uniformly distributed within the transcripts they represent. This non-uniformity must be accounted for when estimating expression levels, and we show how to perform the needed corrections using a likelihood based approach. We find improvements in expression estimates as measured by correlation with independently performed qRT-PCR and show that correction of bias leads to improved replicability of results across libraries and sequencing technologies.

1,220 citations


Journal ArticleDOI
TL;DR: An improved protocol significantly reduces amplification bias and minimizes the previously severe effects of PCR instrument and temperature ramp rate and identifies PCR during library preparation as a principal source of bias and optimized the conditions.
Abstract: Despite the ever-increasing output of Illumina sequencing data, loci with extreme base compositions are often under-represented or absent. To evaluate sources of base-composition bias, we traced genomic sequences ranging from 6% to 90% GC through the process by quantitative PCR. We identified PCR during library preparation as a principal source of bias and optimized the conditions. Our improved protocol significantly reduces amplification bias and minimizes the previously severe effects of PCR instrument and temperature ramp rate.

1,099 citations


Journal ArticleDOI
TL;DR: The largest human microbiota time series analysis to date is presented, covering two individuals at four body sites over 396 timepoints and finds that despite stable differences between body sites and individuals, there is pronounced variability in an individual's microbiota across months, weeks and even days.
Abstract: Background Understanding the normal temporal variation in the human microbiome is critical to developing treatments for putative microbiome-related afflictions such as obesity, Crohn's disease, inflammatory bowel disease and malnutrition. Sequencing and computational technologies, however, have been a limiting factor in performing dense time series analysis of the human microbiome. Here, we present the largest human microbiota time series analysis to date, covering two individuals at four body sites over 396 timepoints.

1,031 citations


Journal ArticleDOI
TL;DR: The Catalogue Of Somatic Mutations In Cancer (COSMIC), one of the largest repositories of information on somatic mutations in human cancer, curates and standardizes this information in a single database, providing user-friendly browsing tools and analytical functions, thus ensuring its role as a key resource inhuman cancer genetics.
Abstract: The Catalogue Of Somatic Mutations In Cancer (COSMIC) [1] is one of the largest repositories of information on somatic mutations in human cancer. The project has been running for more than ten years as part of the Cancer Genome Project (CGP) at the Wellcome Trust Sanger Institute in the UK. The data in COSMIC are curated from a variety of sources, primarily the scientific literature and large international consortia. The project includes information from the CGP, along with data from other consortia such as the International Cancer Genome Consortium and The Cancer Genome Atlas. In addition, COSMIC is regularly updated with the genes highlighted in the Cancer Gene Census, which curates the scientific literature for known cancer genes [2]. With the advent of whole exome and genome sequencing technology, the amount of data in COSMIC is increasing rapidly. The recent COSMIC release (version 53; 18 May 2011) contains 608,042 tumor and cell line samples, annotating 176,856 mutations across 19,439 genes, with 352 full exomes, 43 whole genome rearrangement screens and 4 full genomes now available. The data are updated regularly, with new releases scheduled every two months. COSMIC provides a large number of graphical and tabular views for interpreting and mining the large quantity of information, as well as the facility to export the relevant data in various formats. The website can be navigated in many ways to examine mutation patterns on the basis of genes, samples and phenotypes, which are the main entry points to COSMIC. COSMIC also provides various options to browse the data in a genomic context. Integration with the Ensembl genome browser allows the visualization of full genome annotations, together with COSMIC data, on the GRCh37 genome coordinates. COSMIC also contains its own genome browser, which facilitates data analysis by combining genome-wide gene structures and sequences with rearrangement breakpoints, copy number variations and all somatic substitutions, deletions, insertions and complex gene mutations. The main COSMIC website [1] encompasses all of the available data. However, within COSMIC, the Cancer Cell Line Project [3] is a specialized component, which provides details of the genotyping of almost 800 commonly used cancer cell lines, through the set of known cancer genes. Its focus is to identify driver mutations, or those likely to be implicated in the oncogenesis of each tumor. This information forms the basis for integrating COSMIC with the Genomics of Drug Sensitivity in Cancer project [4], which is a joint effort with the Massachusetts General Hospital [5] to screen this panel of cancer cell lines against potential anticancer therapeutic compounds to investigate correlations between somatic mutations and drug sensitivity. Data on somatic mutations in cancer are being produced at a rapidly increasing rate, and the combined analysis of large distributed datasets is becoming ever more difficult. However, COSMIC curates and standardizes this information in a single database, providing user-friendly browsing tools and analytical functions, thus ensuring its role as a key resource in human cancer genetics.

965 citations


Journal ArticleDOI
TL;DR: A strong genetic component to inter-individual variation in DNA methylation profiles is demonstrated, and there was an enrichment of SNPs that affect both methylation and gene expression, providing evidence for shared mechanisms in a fraction of genes.
Abstract: DNA methylation is an essential epigenetic mechanism involved in gene regulation and disease, but little is known about the mechanisms underlying inter-individual variation in methylation profiles. Here we measured methylation levels at 22,290 CpG dinucleotides in lymphoblastoid cell lines from 77 HapMap Yoruba individuals, for which genome-wide gene expression and genotype data were also available. Association analyses of methylation levels with more than three million common single nucleotide polymorphisms (SNPs) identified 180 CpG-sites in 173 genes that were associated with nearby SNPs (putatively in cis, usually within 5 kb) at a false discovery rate of 10%. The most intriguing trans signal was obtained for SNP rs10876043 in the disco-interacting protein 2 homolog B gene (DIP2B, previously postulated to play a role in DNA methylation), that had a genome-wide significant association with the first principal component of patterns of methylation; however, we found only modest signal of trans-acting associations overall. As expected, we found significant negative correlations between promoter methylation and gene expression levels measured by RNA-sequencing across genes. Finally, there was a significant overlap of SNPs that were associated with both methylation and gene expression levels. Our results demonstrate a strong genetic component to inter-individual variation in DNA methylation profiles. Furthermore, there was an enrichment of SNPs that affect both methylation and gene expression, providing evidence for shared mechanisms in a fraction of genes.

761 citations


Journal ArticleDOI
TL;DR: Mutations in several TRP genes have been implicated in diverse pathological states, including neurodegenerative disorders, skeletal dysplasia, kidney disorders and pain, and ongoing research may help find new therapies for treatments of related diseases.
Abstract: The transient receptor potential (TRP) multigene superfamily encodes integral membrane proteins that function as ion channels. Members of this family are conserved in yeast, invertebrates and vertebrates. The TRP family is subdivided into seven subfamilies: TRPC (canonical), TRPV (vanilloid), TRPM (melastatin), TRPP (polycystin), TRPML (mucolipin), TRPA (ankyrin) and TRPN (NOMPC-like); the latter is found only in invertebrates and fish. TRP ion channels are widely expressed in many different tissues and cell types, where they are involved in diverse physiological processes, such as sensation of different stimuli or ion homeostasis. Most TRPs are non-selective cation channels, only few are highly Ca2+ selective, some are even permeable for highly hydrated Mg2+ ions. This channel family shows a variety of gating mechanisms, with modes of activation ranging from ligand binding, voltage and changes in temperature to covalent modifications of nucleophilic residues. Activated TRP channels cause depolarization of the cellular membrane, which in turn activates voltage-dependent ion channels, resulting in a change of intracellular Ca2+ concentration; they serve as gatekeeper for transcellular transport of several cations (such as Ca2+ and Mg2+), and are required for the function of intracellular organelles (such as endosomes and lysosomes). Because of their function as intracellular Ca2+ release channels, they have an important regulatory role in cellular organelles. Mutations in several TRP genes have been implicated in diverse pathological states, including neurodegenerative disorders, skeletal dysplasia, kidney disorders and pain, and ongoing research may help find new therapies for treatments of related diseases.

734 citations


Journal ArticleDOI
TL;DR: TopHat-Fusion is an enhanced version of TopHat, an efficient program that aligns RNA-seq reads without relying on existing annotation and can discover fusion products deriving from known genes, unknown genes and unannotated splice variants of known genes.
Abstract: TopHat-Fusion is an algorithm designed to discover transcripts representing fusion gene products, which result from the breakage and re-joining of two different chromosomes, or from rearrangements within a chromosome. TopHat-Fusion is an enhanced version of TopHat, an efficient program that aligns RNA-seq reads without relying on existing annotation. Because it is independent of gene annotation, TopHat-Fusion can discover fusion products deriving from known genes, unknown genes and unannotated splice variants of known genes. Using RNA-seq data from breast and prostate cancer cell lines, we detected both previously reported and novel fusions with solid supporting evidence. TopHat-Fusion is available at http://tophat-fusion.sourceforge.net/.

731 citations


Journal ArticleDOI
TL;DR: A web-based application called Cistrome, based on the Galaxy open source framework, that has 29 ChIP-chip- and Chip-seq-specific tools in three major categories, from preliminary peak calling and correlation analyses to downstream genome feature association, gene expression analyses, and motif discovery.
Abstract: The increasing volume of ChIP-chip and ChIP-seq data being generated creates a challenge for standard, integrative and reproducible bioinformatics data analysis platforms. We developed a web-based application called Cistrome, based on the Galaxy open source framework. In addition to the standard Galaxy functions, Cistrome has 29 ChIP-chip- and ChIP-seq-specific tools in three major categories, from preliminary peak calling and correlation analyses to downstream genome feature association, gene expression analyses, and motif discovery. Cistrome is available at http://cistrome.org/ap/.

635 citations


Journal ArticleDOI
TL;DR: An automated, highly scalable method for carrying out the Solution Hybrid Selection capture approach that provides a dramatic increase in scale and throughput of sequence-ready libraries produced is presented.
Abstract: Genome targeting methods enable cost-effective capture of specific subsets of the genome for sequencing. We present here an automated, highly scalable method for carrying out the Solution Hybrid Selection capture approach that provides a dramatic increase in scale and throughput of sequence-ready libraries produced. Significant process improvements and a series of in-process quality control checkpoints are also added. These process improvements can also be used in a manual version of the protocol.

Journal ArticleDOI
TL;DR: In this article, the authors comprehensively evaluate properties of genomic HiSeq and GAIIx data derived from two plant genomes and one virus, with read lengths of 95 to 150 bases.
Abstract: The generation and analysis of high-throughput sequencing data are becoming a major component of many studies in molecular biology and medical research. Illumina's Genome Analyzer (GA) and HiSeq instruments are currently the most widely used sequencing devices. Here, we comprehensively evaluate properties of genomic HiSeq and GAIIx data derived from two plant genomes and one virus, with read lengths of 95 to 150 bases. We provide quantifications and evidence for GC bias, error rates, error sequence context, effects of quality filtering, and the reliability of quality values. By combining different filtering criteria we reduced error rates 7-fold at the expense of discarding 12.5% of alignable bases. While overall error rates are low in HiSeq data we observed regions of accumulated wrong base calls. Only 3% of all error positions accounted for 24.7% of all substitution errors. Analyzing the forward and reverse strands separately revealed error rates of up to 18.7%. Insertions and deletions occurred at very low rates on average but increased to up to 2% in homopolymers. A positive correlation between read coverage and GC content was found depending on the GC content range. The errors and biases we report have implications for the use and the interpretation of Illumina sequencing data. GAIIx and HiSeq data sets show slightly different error profiles. Quality filtering is essential to minimize downstream analysis artifacts. Supporting previous recommendations, the strand-specificity provides a criterion to distinguish sequencing errors from low abundance polymorphisms.

Journal ArticleDOI
TL;DR: A better understanding of mycoparasitism is offered, and the development of improved biocontrol strains for efficient and environmentally friendly protection of plants is enforced.
Abstract: Mycoparasitism, a lifestyle where one fungus is parasitic on another fungus, has special relevance when the prey is a plant pathogen, providing a strategy for biological control of pests for plant protection. Probably, the most studied biocontrol agents are species of the genus Hypocrea/Trichoderma. Here we report an analysis of the genome sequences of the two biocontrol species Trichoderma atroviride (teleomorph Hypocrea atroviridis) and Trichoderma virens (formerly Gliocladium virens, teleomorph Hypocrea virens), and a comparison with Trichoderma reesei (teleomorph Hypocrea jecorina). These three Trichoderma species display a remarkable conservation of gene order (78 to 96%), and a lack of active mobile elements probably due to repeat-induced point mutation. Several gene families are expanded in the two mycoparasitic species relative to T. reesei or other ascomycetes, and are overrepresented in non-syntenic genome regions. A phylogenetic analysis shows that T. reesei and T. virens are derived relative to T. atroviride. The mycoparasitism-specific genes thus arose in a common Trichoderma ancestor but were subsequently lost in T. reesei. The data offer a better understanding of mycoparasitism, and thus enforce the development of improved biocontrol strains for efficient and environmentally friendly protection of plants.

Journal ArticleDOI
TL;DR: The availability of the Cannabis sativa genome enables the study of a multifunctional plant that occupies a unique role in human culture and will aid the development of therapeutic marijuana strains with tailored cannabinoid profiles and provide a basis for the breeding of hemp with improved agronomic characteristics.
Abstract: Cannabis sativa has been cultivated throughout human history as a source of fiber, oil and food, and for its medicinal and intoxicating properties. Selective breeding has produced cannabis plants for specific uses, including high-potency marijuana strains and hemp cultivars for fiber and seed production. The molecular biology underlying cannabinoid biosynthesis and other traits of interest is largely unexplored. We sequenced genomic DNA and RNA from the marijuana strain Purple Kush using shortread approaches. We report a draft haploid genome sequence of 534 Mb and a transcriptome of 30,000 genes. Comparison of the transcriptome of Purple Kush with that of the hemp cultivar 'Finola' revealed that many genes encoding proteins involved in cannabinoid and precursor pathways are more highly expressed in Purple Kush than in 'Finola'. The exclusive occurrence of Δ9-tetrahydrocannabinolic acid synthase in the Purple Kush transcriptome, and its replacement by cannabidiolic acid synthase in 'Finola', may explain why the psychoactive cannabinoid Δ9-tetrahydrocannabinol (THC) is produced in marijuana but not in hemp. Resequencing the hemp cultivars 'Finola' and 'USO-31' showed little difference in gene copy numbers of cannabinoid pathway enzymes. However, single nucleotide variant analysis uncovered a relatively high level of variation among four cannabis types, and supported a separation of marijuana and hemp. The availability of the Cannabis sativa genome enables the study of a multifunctional plant that occupies a unique role in human culture. Its availability will aid the development of therapeutic marijuana strains with tailored cannabinoid profiles and provide a basis for the breeding of hemp with improved agronomic characteristics.

Journal ArticleDOI
TL;DR: A genome-wide map of 5hmC in human embryonic stem cells is generated by hmeDIP-seq, in which hydroxymethyl-DNA immunoprecipitation is followed by massively parallel sequencing, and it is found that5hmC is enriched in enhancers as well as in gene bodies, suggesting a potential role for 4ET proteins in gene regulation.
Abstract: Background: 5-Hydroxymethylcytosine (5hmC) was recently found to be abundantly present in certain cell types, including embryonic stem cells. There is growing evidence that TET proteins, which convert 5-methylcytosine (5mC) to 5hmC, play important biological roles. To further understand the function of 5hmC, an analysis of the genome-wide localization of this mark is required. Results: Here, we have generated a genome-wide map of 5hmC in human embryonic stem cells by hmeDIP-seq, in which hydroxymethyl-DNA immunoprecipitation is followed by massively parallel sequencing. We found that 5hmC is enriched in enhancers as well as in gene bodies, suggesting a potential role for 5hmC in gene regulation. Consistent with localization of 5hmC at enhancers, 5hmC was significantly enriched in histone modifications associated with enhancers, such as H3K4me1 and H3K27ac. 5hmC was also enriched in other protein-DNA interaction sites, such as OCT4 and NANOG binding sites. Furthermore, we found that 5hmC regions tend to have an excess of G over C on one strand of DNA. Conclusions: Our findings suggest that 5hmC may be targeted to certain genomic regions based both on gene expression and sequence composition.

Journal ArticleDOI
TL;DR: Alu elements are primate-specific repeats and comprise 11% of the human genome and have wide-ranging influences on gene expression and genome evolution, gene regulation and disease.
Abstract: Alu elements are primate-specific repeats and comprise 11% of the human genome. They have wide-ranging influences on gene expression. Their contribution to genome evolution, gene regulation and disease is reviewed.

Journal ArticleDOI
TL;DR: This work reveals how differences in biogenesis, function and evolution contribute to characteristic features of microRNA evolution in the two kingdoms.
Abstract: MicroRNAs are pervasive in both plants and animals, but many aspects of their biogenesis, function and evolution differ. We reveal how these differences contribute to characteristic features of microRNA evolution in the two kingdoms.

Journal ArticleDOI
TL;DR: The evolution of Atg8 proteins is discussed and the current view of their function in intracellular trafficking and autophagy from a structural perspective is summarized.
Abstract: Autophagy-related (Atg) proteins are eukaryotic factors participating in various stages of the autophagic process. Thus far 34 Atgs have been identified in yeast, including the key autophagic protein Atg8. The Atg8 gene family encodes ubiquitin-like proteins that share a similar structure consisting of two amino-terminal α helices and a ubiquitin-like core. Atg8 family members are expressed in various tissues, where they participate in multiple cellular processes, such as intracellular membrane trafficking and autophagy. Their role in autophagy has been intensively studied. Atg8 proteins undergo a unique ubiquitin-like conjugation to phosphatidylethanolamine on the autophagic membrane, a process essential for autophagosome formation. Whereas yeast has a single Atg8 gene, many other eukaryotes contain multiple Atg8 orthologs. Atg8 genes of multicellular animals can be divided, by sequence similarities, into three subfamilies: microtubule-associated protein 1 light chain 3 (MAP1LC3 or LC3), γ-aminobutyric acid receptor-associated protein (GABARAP) and Golgi-associated ATPase enhancer of 16 kDa (GATE-16), which are present in sponges, cnidarians (such as sea anemones, corals and hydras) and bilateral animals. Although genes from all three subfamilies are found in vertebrates, some invertebrate lineages have lost the genes from one or two subfamilies. The amino terminus of Atg8 proteins varies between the subfamilies and has a regulatory role in their various functions. Here we discuss the evolution of Atg8 proteins and summarize the current view of their function in intracellular trafficking and autophagy from a structural perspective.

Journal ArticleDOI
TL;DR: Genome-wide association studies (GWAS) have been even more successful in plants than in humans and mapping approaches can be extended to dissect adaptive genetic variation from structured background variation in an ecological context.
Abstract: Genome-wide association studies (GWAS) have been even more successful in plants than in humans. Mapping approaches can be extended to dissect adaptive genetic variation from structured background variation in an ecological context.

Journal ArticleDOI
TL;DR: The TIMP family is an ancient one, with a single representative in lower eukaryotes and four members in mammals, and recently, non-inhibitory functions of TIMPs have been identified in mammalian cells, including signaling roles downstream of specific receptors.
Abstract: Orchestration of the growth and remodeling of tissues and responses of cells to their extracellular environment is mediated by metalloproteinases of the Metzincin clan. This group of proteins comprises several families of endopeptidases in which a zinc atom is liganded at the catalytic site to three histidine residues and an invariant methionine residue. Tissue inhibitors of metalloproteinases (TIMPs) are endogenous protein regulators of the matrix metalloproteinase (MMPs) family, and also of families such as the disintegrin metalloproteinases (ADAM and ADAMTS). TIMPs therefore have a pivotal role in determining the influence of the extracellular matrix, of cell adhesion molecules, and of many cytokines, chemokines and growth factors on cell phenotype. The TIMP family is an ancient one, with a single representative in lower eukaryotes and four members in mammals. Although much is known about their mechanism of action in proteinase regulation in mammalian cells, less is known about their functions in lower organisms. Recently, non-inhibitory functions of TIMPs have been identified in mammalian cells, including signaling roles downstream of specific receptors. There are clearly still questions to be answered with regard to their overall roles in biology.

Journal ArticleDOI
TL;DR: Genetic studies in several model organisms have helped to unravel a multitude of physiological functions associated with cullin proteins and their respective CRLs, which have an impact on a range of biological processes, including cell growth, development, signal transduction, transcriptional control, genomic integrity and tumor suppression.
Abstract: Cullin proteins are molecular scaffolds that have crucial roles in the post-translational modification of cellular proteins involving ubiquitin. The mammalian cullin protein family comprises eight members (CUL1 to CUL7 and PARC), which are characterized by a cullin homology domain. CUL1 to CUL7 assemble multi-subunit Cullin-RING E3 ubiquitin ligase (CRL) complexes, the largest family of E3 ligases with more than 200 members. Although CUL7 and PARC are present only in chordates, other members of the cullin protein family are found in Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana and yeast. A cullin protein tethers both a substrate-targeting unit, often through an adaptor protein, and the RING finger component in a CRL. The cullin-organized CRL thus positions a substrate close to the RING-bound E2 ubiquitin-conjugating enzyme, which catalyzes the transfer of ubiquitin to the substrate. In addition, conjugation of cullins with the ubiquitin-like molecule Nedd8 modulates activation of the corresponding CRL complex, probably through conformational regulation of the interactions between cullin's carboxy-terminal tail and CRL's RING subunit. Genetic studies in several model organisms have helped to unravel a multitude of physiological functions associated with cullin proteins and their respective CRLs. CRLs target numerous substrates and thus have an impact on a range of biological processes, including cell growth, development, signal transduction, transcriptional control, genomic integrity and tumor suppression. Moreover, mutations in CUL7 and CUL4B genes have been linked to hereditary human diseases.

Journal ArticleDOI
TL;DR: A specific subset of poly(A)- histone mRNAs are identified that are expressed in undifferentiated hESCs and are rapidly diminished upon differentiation; further, these same histone genes are induced upon reprogramming of fibroblasts to induced pluripotent stem cells.
Abstract: RNAs can be physically classified into poly(A)+ or poly(A)- transcripts according to the presence or absence of a poly(A) tail at their 3' ends. Current deep sequencing approaches largely depend on the enrichment of transcripts with a poly(A) tail, and therefore offer little insight into the nature and expression of transcripts that lack poly(A) tails. We have used deep sequencing to explore the repertoire of both poly(A)+ and poly(A)- RNAs from HeLa cells and H9 human embryonic stem cells (hESCs). Using stringent criteria, we found that while the majority of transcripts are poly(A)+, a significant portion of transcripts are either poly(A)- or bimorphic, being found in both the poly(A)+ and poly(A)- populations. Further analyses revealed that many mRNAs may not contain classical long poly(A) tails and such messages are overrepresented in specific functional categories. In addition, we surprisingly found that a few excised introns accumulate in cells and thus constitute a new class of non-polyadenylated long non-coding RNAs. Finally, we have identified a specific subset of poly(A)- histone mRNAs, including two histone H1 variants, that are expressed in undifferentiated hESCs and are rapidly diminished upon differentiation; further, these same histone genes are induced upon reprogramming of fibroblasts to induced pluripotent stem cells. We offer a rich source of data that allows a deeper exploration of the poly(A)- landscape of the eukaryotic transcriptome. The approach we present here also applies to the analysis of the poly(A)- transcriptomes of other organisms.

Journal ArticleDOI
TL;DR: The lamins are the major architectural proteins of the animal cell nucleus and provide a platform for the binding of proteins and chromatin and confer mechanical stability, underscoring their functional importance.
Abstract: The lamins are the major architectural proteins of the animal cell nucleus. Lamins line the inside of the nuclear membrane, where they provide a platform for the binding of proteins and chromatin and confer mechanical stability. They have been implicated in a wide range of nuclear functions, including higher-order genome organization, chromatin regulation, transcription, DNA replication and DNA repair. The lamins are members of the intermediate filament (IF) family of proteins, which constitute a major component of the cytoskeleton. Lamins are the only nuclear IFs and are the ancestral founders of the IF protein superfamily. Lamins polymerize into fibers forming a complex protein meshwork in vivo and, like all IF proteins, have a tripartite structure with two globular head and tail domains flanking a central α-helical rod domain, which supports the formation of higher-order polymers. Mutations in lamins cause a large number of diverse human diseases, collectively known as the laminopathies, underscoring their functional importance.

Journal ArticleDOI
TL;DR: Advances in sequencing technology have led to a sharp decrease in the cost of 'data generation', but is this sufficient to ensure cost-effective and efficient 'knowledge generation'?
Abstract: Advances in sequencing technology have led to a sharp decrease in the cost of 'data generation'. But is this sufficient to ensure cost-effective and efficient 'knowledge generation'?

Journal ArticleDOI
TL;DR: It is established that C. militaris is sexually heterothallic but, very unusually, fruiting can occur without an opposite mating-type partner, which suggests a more restricted ecology.
Abstract: Species in the ascomycete fungal genus Cordyceps have been proposed to be the teleomorphs of Metarhizium species The latter have been widely used as insect biocontrol agents Cordyceps species are highly prized for use in traditional Chinese medicines, but the genes responsible for biosynthesis of bioactive components, insect pathogenicity and the control of sexuality and fruiting have not been determined Here, we report the genome sequence of the type species Cordyceps militaris Phylogenomic analysis suggests that different species in the Cordyceps/Metarhizium genera have evolved into insect pathogens independently of each other, and that their similar large secretomes and gene family expansions are due to convergent evolution However, relative to other fungi, including Metarhizium spp, many protein families are reduced in C militaris, which suggests a more restricted ecology Consistent with its long track record of safe usage as a medicine, the Cordyceps genome does not contain genes for known human mycotoxins We establish that C militaris is sexually heterothallic but, very unusually, fruiting can occur without an opposite mating-type partner Transcriptional profiling indicates that fruiting involves induction of the Zn2Cys6-type transcription factors and MAPK pathway; unlike other fungi, however, the PKA pathway is not activated The data offer a better understanding of Cordyceps biology and will facilitate the exploitation of medicinal compounds produced by the fungus

Journal ArticleDOI
TL;DR: A novel iterative method is proposed, based on the expectation maximization algorithm, that reconstructs full-length small subunit gene sequences and provides estimates of relative taxon abundances in microbial communities.
Abstract: Recovery of ribosomal small subunit genes by assembly of short read community DNA sequence data generally fails, making taxonomic characterization difficult. Here, we solve this problem with a novel iterative method, based on the expectation maximization algorithm, that reconstructs full-length small subunit gene sequences and provides estimates of relative taxon abundances. We apply the method to natural and simulated microbial communities, and correctly recover community structure from known and previously unreported rRNA gene sequences. An implementation of the method is freely available at https://github.com/csmiller/EMIRGE.

Journal ArticleDOI
TL;DR: A novel informatics concept of the molecular fragmentation query language implemented within the LipidXplorer open source software kit that supports accurate quantification of individual species of any ionizable lipid class in shotgun spectra acquired on any mass spectrometry platform is presented.
Abstract: Shotgun lipidome profiling relies on direct mass spectrometric analysis of total lipid extracts from cells, tissues or organisms and is a powerful tool to elucidate the molecular composition of lipidomes. We present a novel informatics concept of the molecular fragmentation query language implemented within the LipidXplorer open source software kit that supports accurate quantification of individual species of any ionizable lipid class in shotgun spectra acquired on any mass spectrometry platform.

Journal ArticleDOI
TL;DR: It is shown that PARalyzer delineates sites with a high signal-to-noise ratio, which identifies the sequence preferences of RNA-binding proteins, as well as seed-matches for highly expressed microRNAs when profiling Argonaute proteins.
Abstract: Crosslinking and immunoprecipitation (CLIP) protocols have made it possible to identify transcriptome-wide RNA-protein interaction sites. In particular, PAR-CLIP utilizes a photoactivatable nucleoside for more efficient crosslinking. We present an approach, centered on the novel PARalyzer tool, for mapping high-confidence sites from PAR-CLIP deep-sequencing data. We show that PARalyzer delineates sites with a high signal-to-noise ratio. Motif finding identifies the sequence preferences of RNA-binding proteins, as well as seed-matches for highly expressed microRNAs when profiling Argonaute proteins. Our study describes tailored analytical methods and provides guidelines for future efforts to utilize high-throughput sequencing in RNA biology. PARalyzer is available at http://www.genome.duke.edu/labs/ohler/research/PARalyzer/.

Journal ArticleDOI
TL;DR: IsomiRs are found to be biologically relevant and functionally cooperative partners of canonical miRNAs that act coordinately to target pathways of functionally related genes and helps explain a major miRNA paradox.
Abstract: Variants of microRNAs (miRNAs), called isomiRs, are commonly reported in deep-sequencing studies; however, the functional significance of these variants remains controversial. Observational studies show that isomiR patterns are non-random, hinting that these molecules could be regulated and therefore functional, although no conclusive biological role has been demonstrated for these molecules. To assess the biological relevance of isomiRs, we have performed ultra-deep miRNA-seq on ten adult human tissues, and created an analysis pipeline called miRNA-MATE to align, annotate, and analyze miRNAs and their isomiRs. We find that isomiRs share sequence and expression characteristics with canonical miRNAs, and are generally strongly correlated with canonical miRNA expression. A large proportion of isomiRs potentially derive from AGO2 cleavage independent of Dicer. We isolated polyribosome-associated mRNA, captured the mRNA-bound miRNAs, and found that isomiRs and canonical miRNAs are equally associated with translational machinery. Finally, we transfected cells with biotinylated RNA duplexes encoding isomiRs or their canonical counterparts and directly assayed their mRNA targets. These studies allow us to experimentally determine genome-wide mRNA targets, and these experiments showed substantial overlap in functional mRNA networks suppressed by both canonical miRNAs and their isomiRs. Together, these results find isomiRs to be biologically relevant and functionally cooperative partners of canonical miRNAs that act coordinately to target pathways of functionally related genes. This work exposes the complexity of the miRNA-transcriptome, and helps explain a major miRNA paradox: how specific regulation of biological processes can occur when the specificity of miRNA targeting is mediated by only 6 to 11 nucleotides.

Journal ArticleDOI
TL;DR: RNA interference-mediated knock-down of the VAPB-IKZF3 fusion gene indicated that it may be necessary for cancer cell growth and survival, and a number of novel fusion genes in breast cancer cells are discovered using RNA-sequencing and improved bioinformatic stratification.
Abstract: Until recently, chromosomal translocations and fusion genes have been an underappreciated class of mutations in solid tumors. Next-generation sequencing technologies provide an opportunity for systematic characterization of cancer cell transcriptomes, including the discovery of expressed fusion genes resulting from underlying genomic rearrangements. We applied paired-end RNA-seq to identify 24 novel and 3 previously known fusion genes in breast cancer cells. Supported by an improved bioinformatic approach, we had a 95% success rate of validating gene fusions initially detected by RNA-seq. Fusion partner genes were found to contribute promoters (5' UTR), coding sequences and 3' UTRs. Most fusion genes were associated with copy number transitions and were particularly common in high-level DNA amplifications. This suggests that fusion events may contribute to the selective advantage provided by DNA amplifications and deletions. Some of the fusion partner genes, such as GSDMB in the TATDN1-GSDMB fusion and IKZF3 in the VAPB-IKZF3 fusion, were only detected as a fusion transcript, indicating activation of a dormant gene by the fusion event. A number of fusion gene partners have either been previously observed in oncogenic gene fusions, mostly in leukemias, or otherwise reported to be oncogenic. RNA interference-mediated knock-down of the VAPB-IKZF3 fusion gene indicated that it may be necessary for cancer cell growth and survival. In summary, using RNA-sequencing and improved bioinformatic stratification, we have discovered a number of novel fusion genes in breast cancer, and identified VAPB-IKZF3 as a potential fusion gene with importance for the growth and survival of breast cancer cells.