scispace - formally typeset
Search or ask a question
Dissertation

G-quadruplexes and gene expression in Arabidopsis thaliana

19 Mar 2019-
TL;DR: A novel method for identifying G4s is introduced, which uses a machine learning approach trained on datasets derived from the high throughput sequencing of G4 structures, to study the prevalence of PG4s in the genome of Arabidopsis thaliana, the model plant.
Abstract: G-Quadruplexes (G4s) are four stranded DNA structures which form in regions with high GC content and high GC skew. Because of the dependence of G4 structure on specific sequences, it is possible to predict putative G4s (PG4s) throughout genomic sequence. PG4s are non-uniformly distributed in genomes, with higher densities within various genic features, particularly promoters, 5’ untranslated regions (UTRs) and coding sequences (CDSs). When they form G4s, these sequences can have a variety of implications for biological processes including replication, transcription, translation and splicing. Here, we introduce a novel method for identifying PG4s, which uses a machine learning approach trained on datasets derived from the high throughput sequencing of G4 structures. We apply this and other techniques, to study the prevalence of PG4s in the genome of Arabidopsis thaliana, the model plant. Finally, we study the effect of G4 stabilisation on gene expression in Arabidopsis, using the GQuadruplex binding agent N-methyl mesoporphyrin (NMM). We identify a family of genes which are strongly downregulated by NMM, and find that they contain large numbers of PG4s in their CDSs.
Citations
More filters
01 Jan 2011
TL;DR: The sheer volume and scope of data posed by this flood of data pose a significant challenge to the development of efficient and intuitive visualization tools able to scale to very large data sets and to flexibly integrate multiple data types, including clinical data.
Abstract: Rapid improvements in sequencing and array-based platforms are resulting in a flood of diverse genome-wide data, including data from exome and whole-genome sequencing, epigenetic surveys, expression profiling of coding and noncoding RNAs, single nucleotide polymorphism (SNP) and copy number profiling, and functional assays. Analysis of these large, diverse data sets holds the promise of a more comprehensive understanding of the genome and its relation to human disease. Experienced and knowledgeable human review is an essential component of this process, complementing computational approaches. This calls for efficient and intuitive visualization tools able to scale to very large data sets and to flexibly integrate multiple data types, including clinical data. However, the sheer volume and scope of data pose a significant challenge to the development of such tools.

2,187 citations

01 Nov 2017
TL;DR: ChromHMM combines multiple genome-wide epigenomic maps, and uses combinatorial and spatial mark patterns to infer a complete annotation for each cell type, and provides an automated enrichment analysis of the resulting annotations to facilitate the functional interpretations of each chromatin state.
Abstract: Noncoding DNA regions have central roles in human biology, evolution, and disease. ChromHMM helps to annotate the noncoding genome using epigenomic information across one or multiple cell types. It combines multiple genome-wide epigenomic maps, and uses combinatorial and spatial mark patterns to infer a complete annotation for each cell type. ChromHMM learns chromatin-state signatures using a multivariate hidden Markov model (HMM) that explicitly models the combinatorial presence or absence of each mark. ChromHMM uses these signatures to generate a genome-wide annotation for each cell type by calculating the most probable state for each genomic segment. ChromHMM provides an automated enrichment analysis of the resulting annotations to facilitate the functional interpretations of each chromatin state. ChromHMM is distinguished by its modeling emphasis on combinations of marks, its tight integration with downstream functional enrichment analyses, its speed, and its ease of use. Chromatin states are learned, annotations are produced, and enrichments are computed within 1 d.

364 citations

Journal Article
TL;DR: Over the years, programming languages have grown more powerful, but correspondingly more complex; and while that complexity is fine and appropriate for professional programmers, it hinders and discourages beginning Computer Science students.
Abstract: Over the years, programming languages have grown more powerful, but correspondingly more complex; and while that complexity is fine and appropriate for professional programmers, it hinders and discourages beginning Computer Science students.

249 citations

Journal ArticleDOI
TL;DR: In this paper, the authors present a summary of issues that faculty members should review as they begin to consider retirement, including the benefits they consider to be important and the issues that need to be considered.
Abstract: beloved and consummate University citizen who contributed to the well being of the institution and his fellow faculty in innumerable ways. He spearheaded the creation of Hitchhiker, led its compilation and subsequent annual updates, and edited the first three editions. He will be missed. This guide was prepared by members of the Penn Association of Senior and Emeritus Faculty as a summary of issues that faculty members should review as they begin to consider retirement. It is not intended to be a detailed description of available benefits, nor is it intended to replace any of the official documents published by the University of Pennsylvania. The guide was not prepared by and is not published by the University of Pennsylvania, its Division of Human Resources, or any University benefits administrators. The University therefore makes no representations or assurances regarding its accuracy or completeness. Faculty are strongly encouraged to review in detail any summary plan description of benefits they consider to be important, as well as to speak to representatives from the Division of Human Resources before making any decision regarding retirement or benefits. The University offers many benefits to active and retired faculty, the terms of which are set forth in various plans and summary plan descriptions, which may be subject to change.

105 citations

Journal Article

19 citations

References
More filters
Journal Article
TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

47,974 citations


"G-quadruplexes and gene expression ..." refers methods in this paper

  • ...Receiver Operator Characteristic (ROC, false positive rate plotted against true positive rate) and Precision Recall (PR, precision plotted against recall) curves were generated using scikit-learn and plotted with matplotlib (Hunter, 2007; Pedregosa et al., 2011)....

    [...]

Journal ArticleDOI
TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

47,038 citations


"G-quadruplexes and gene expression ..." refers background in this paper

  • ...1 (Love et al., 2014) and log2 transformed to get log counts per million (logCPM)....

    [...]

Journal ArticleDOI
TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

45,957 citations

Journal ArticleDOI
TL;DR: The Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure outperforms other aligners by a factor of >50 in mapping speed.
Abstract: Motivation Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. Results To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. Availability and implementation STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.

30,684 citations


"G-quadruplexes and gene expression ..." refers background or methods in this paper

  • ...228 13 Acronyms and Abbreviations Acronym Definition AGI Arabidopsis genome initiative AID Activation-induced cytidine deaminase AS Alternative splicing ATAC-seq Assay for transposase-accessible chromatin using sequencing ATP Adenosine triphosphate ATR-X/ATRX Alpha-thalassemia mental retardation syndrome/protein AUC Area under the curve BAM Binary alignment map format BED Browser extensible data format BG4 G-Quadruplex binding antibody BLAT BLAT CD Circular dichroism cDNA Complementary DNA CDS Coding sequence ChIP-seq Chromatin immunoprecipitation with sequencing CHX Cyclohexamide CPM Counts per million CSR Class switch recombination CTD Carboxy-terminal domain DMS Dimethyl sulphate DMSO Dimethyl sulphoxide DNA Deoxyribonucleic Acid DNase Deoxyribonuclease DPE Downstream promoter element DSB Double strand break dsDNA Double stranded DNA 14 EMSA Electrophoretic mobility shift assay ENA European nucleotide archive ENCODE Encyclopedia of DNA elements EXT Extensin FDR False discovery rate FISH Fluorescent in situ hybridisation FPR False positive rate FRET Förster resonance energy transfer FTP File transfer protocol G4 G-Quadruplex GC content/skew Guanine and cytosine content/skew GEM Genome multitool GEO Gene expression omnibus GO Gene ontology GRO-seq Global Run On Sequencing GTF/GFF Gene transfer format/General feature format gtRNAdb Genomic tRNA database HDAC Histone deacetylase HDF5 Hierarchical data format 5 HEK Human embryonic kidney hnRNP Heterogeneous ribonucleoprotein particle IgH Immunoglobulin Heavy Chain IGV Intergrative genome viewer IRES Internal ribosome entry site lncRNA Long non-coding RNA logCPM Log counts per million logFC Log fold change LRX Leucine rich repeat extensin LSTM Long Short Term Memory 15 LTR Long terminal repeat MA plot Mean average plot miRNA MicroRNA MLP Multi-layer perceptron mRNA Messenger RNA MS Murashige & Skoog media MWU Mann Whitney U Test ncRNA Non-coding RNA NER Nucleotide excision repair NHEIII Nuclease hypersensitive element III NMM N-Methyl Mesoporphyrin IX NMR Nuclear magnetic resonance NOESY Nuclear Overhauser effect spectroscopy PAS Polyadenylation site PCR Polymerase chain reaction PCS Potential coding sequence PDB Protein data bank PDS Pyridostatin PG4 Putative G-Quadruplex PIC Preinitiation complex Pol II RNA polymerase II PPLR Probability of positive log ratio PR Precision recall pre-mRNA Precursor mRNA qPCR Quantitative polymerase chain reaction RMA Robust multichip average RMA Robust Multi-chip Averaging RNA Ribonucleic acid RNase Ribonuclease 16 RNA-seq RNA sequencing ROC Receiver operator characteristic RT Reverse transcriptase SCE Sister chromatid exchange SELEX Systematic evolution of ligands by exponential enrichment SHAPE Selective 2’-hydroxyl acylation analyzed by primer extension smFRET Single molecule Förster resonance energy transfer snRNA Small nuclear RNA SOM Self organising map SP3/4/5 Serine proline 3/4/5 motif ssDNA Single stranded DNA STAR Spliced transcript alignment to a reference SWI/SNF Switch/Sucrose non-fermentable tAI tRNA adaptation index TAIR The Arabidopsis information resource TE Transposable element TFIIH Transcription initiation factor complex TPR True positive rate tRNA transfer RNA TSS Transcriptional start site TTS Transcriptional termination site UMAP Uniform manifold approximation and projection UTR Untranslated region UV Ultraviolet 17 Chapter 1 Introduction...

    [...]

  • ...Since the RNAseq dataset is unstranded, Cufflinks requires the upstream mapping tool (here, STAR (Dobin et al., 2013)) to annotate the orientation of spliced reads using the intron motif (i....

    [...]

  • ...Mapping parameters for STAR (Dobin et al., 2013) were made more stringent than defaults in an attempt to increase the precision of mapping over Extensin genes without attenuating recall of splice junctions too strongly....

    [...]

  • ...2a (Dobin et al., 2013) with default parameters, and generated BAM files were sorted...

    [...]

  • ...These in frame splice junctions could therefore simply be the result of mapping errors from the spliced aligner STAR (Dobin et al., 2013), which utilises heuristics which may result in some reads from contiguous parts of the genome being mapped as spliced....

    [...]

Journal ArticleDOI
TL;DR: EdgeR as mentioned in this paper is a Bioconductor software package for examining differential expression of replicated count data, which uses an overdispersed Poisson model to account for both biological and technical variability and empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.
Abstract: Summary: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions. edgeR is a Bioconductor software package for examining differential expression of replicated count data. An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated. The software may have other applications beyond sequencing data, such as proteome peptide count data. Availability: The package is freely available under the LGPL licence from the Bioconductor web site (http://bioconductor.org).

29,413 citations