scispace - formally typeset
Search or ask a question
Posted ContentDOI

Rapid and Efficient Co-Transcriptional Splicing Enhances Mammalian Gene Expression

12 Feb 2020-bioRxiv (Cold Spring Harbor Laboratory)-
TL;DR: A patient-derived mutation in β-globin that causes thalassemia improves splicing efficiency and proper termination, revealing co-transcriptionalsplicing efficiency is a determinant of productive gene output.
Abstract: Pre-mRNA splicing is tightly coordinated with transcription in yeasts, and introns can be removed soon after they emerge from RNA polymerase II (Pol II). To determine if splicing is similarly rapid and efficient in mammalian cells, we performed long read sequencing of nascent RNA during mouse erythropoiesis. Remarkably, 50% of splicing occurred while Pol II was within 150 nucleotides of 3′ splice sites. PRO-seq revealed that Pol II does not pause around splice sites, confirming that mammalian and yeast spliceosomes can act equally rapidly. Two exceptions were observed. First, several hundred introns displayed abundant splicing intermediates, suggesting that the spliceosome can stall after the first catalytic step. Second, some genes – notably globins – displayed poor splicing coupled to readthrough transcription. Remarkably, a patient-derived mutation in β-globin that causes thalassemia improves splicing efficiency and proper termination, revealing co-transcriptional splicing efficiency is a determinant of productive gene output.

Summary (6 min read)

INTRODUCTION

  • Transcription and pre-mRNA processing steps – 5′ end capping, splicing, base modification, and 3′ end cleavage – required for eukaryotic gene expression are each carried out by macromolecular machines.
  • Spliceosomes may not assemble on all of the introns at the same time, because promoter-proximal introns are synthesized before promoterdistal introns.
  • Co-transcriptional splicing also demands that the constellation of splicing factors capable of regulating a splicing event bind the nascent RNA coordinately with the timing imposed by transcription and in a relevant spatial window.
  • Those studies did not explore coupling to 3′ end formation.
  • Whether 3′ end cleavage efficiency contributes to gene expression levels in mammalian cells is currently unknown.

PacBio Long-read Sequencing of Nascent RNA Yields High Read Coverage

  • Murine erythroleukemia (MEL) cells are immortalized at the proerythroblast stage and can be induced to enter terminal erythroid differentiation by treatment with 2% DMSO for five days (Antoniou, 1991).
  • Phenotypic changes include decreased cell volume, increased levels of β-globin, and visible hemoglobinization .
  • Chromatin purification under stringent washing conditions allows release of contaminating RNAs and retains the stable ternary complex formed by elongating Pol II, DNA, and nascent RNA .
  • To generate libraries for LRS, the authors established the protocol outlined in Figure 1A.
  • More than 7,500 genes were represented by more than 10 reads per gene in each condition .

LRS Reveals Rapid and Efficient Co-transcriptional Splicing

  • Each long-read provides two critical pieces of information: the 3′ end reveals the position of Pol II when the RNA was isolated; the splice junctions reveal if splicing has occurred and which splice sites were chosen.
  • To validate this finding, the authors examined the read length distribution for reads of each splicing status .
  • One explanation for the relatively short distances observed between splice junctions and Pol II may be that Pol II pauses just downstream of an intron, allowing time for splicing to occur before elongation continues.
  • To control for the possibility that high PRO-seq density from TSS peaks might bleed through to the first 5′SS, first introns were independently analyzed.
  • To determine what features of specific introns might lead to increased splicing intermediates, the authors counted and normalized the number of splicing intermediates observed for each intron.

Unspliced Transcripts Display Poor Cleavage at Gene Ends

  • Consistent with physiological terminal erythroid differentiation, their induced MEL cells shifted to maximal expression of a- and β-globin genes, each containing two introns.
  • To their surprise, a large fraction of individual β-globin long-reads in the induced condition had 3′ ends that were up to 2.5 kb downstream of the annotated polyA site (PAS), indicating that these transcripts failed to undergo 3′ end cleavage at the PAS.
  • Notably, PRO-seq reads are commonly detected well past the gene 3′ ends due to transcription termination (Core et al., 2008).
  • Coverage of all unspliced reads was globally higher in the region downstream of a PAS than it was for partially spliced or all spliced reads .
  • This genome-wide decrease in splicing efficiency associated with impaired 3′ end cleavage confirmed the coordination between splicing and 3′ end processing prominently observed in the globin genes.

A β-thalassemia Mutation Enhances Splicing and 3′ End Cleavage Efficiencies

  • To investigate how mutations in splice sites alter co-transcriptional splicing efficiency, the authors took advantage of a known β-thalassemia allele.
  • This thalassemia-causing mutation, known as IVS-110, generates an HBB mRNA with an in-frame stop codon, resulting in a 90% reduction in functional HBB protein through nonsense-mediated decay (Spritz et al., 1981; Vadolas et al., 2006).
  • To rigorously test the possibility that changes in co-transcriptional splicing efficiency determine 3′ end cleavage, read coverage downstream of the HBB PAS was used to detect uncleaved long-reads for each category of splicing status .
  • All-unspliced HBB reads were detected up to 4 kb past the PAS, similar to endogenous mouse globin genes.
  • When only intron 2 was spliced, cleavage in MEL-HBB WT and MEL-HBB IVS-110(G>A) cells was similar .

DISCUSSION

  • This study reveals functional relationships between co-transcriptional RNA processing events through genome-wide analysis of individual nascent transcripts purified from differentiating mammalian erythroid cells.
  • Thus, spliceosome assembly and the transition to catalysis often occur when the spliceosome is physically close to Pol II.
  • The authors conclude that splicing more typically occurs when Pol is close to the intron.
  • The authors identified spliced reads within the PRO-seq data, validating the observations made with LRS of purified nascent RNA with an independent method.
  • The fraction of efficiently spliced -globin transcripts increased in the thalassemia allele the authors studied, even though the cryptic 3′SS yields an out of frame mRNA that will – like many thalassemia alleles of -globin – be degraded by nonsense-mediated decay (Kurosaki et al., 2019).

LIMITATIONS

  • First, the length of long-reads are dependent on reverse transcriptase processivity when copying RNA into cDNA.
  • While the authors have taken steps to enrich for full-length transcripts in their library generation, some RNAs are likely not fully reverse transcribed and captured in this dataset.
  • Second, the authors have not addressed directly what the ultimate fate of unspliced and uncleaved nascent RNA is in these cells.
  • Finally, a more rigorous test of their proposed mechanism linking splicing and 3′ end cleavage would require tools to probe inhibition of both processes.

ACKNOWLEDGMENTS

  • The authors thank P Patsali for sharing the MEL-HBB WT and MEL-HBB IVS-110(G>A) cell lines, M Antoniou for sharing an annotation of the GLOBE vector, and J Conboy for advice on erythroblast fractionation.
  • The authors thank E Brown for help with preparation of LRS figures, J Gordon for technical assistance, and H Tilgner, T Carrocci, D Phizicky, T Alpert, T Henriques, and B Martin for helpful discussions and comments on the manuscript.
  • This work was initiated through pilot funding from NIDDK under Grant U54DK106857 to the Yale Cooperative Center of Excellence in Hematology (to K.M.N.).
  • Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH.
  • K.A.R. is supported by a Postgraduate Scholarship from the Natural Sciences and Engineering Research Council of Canada and a Gruber Science Fellowship, and C.M. is supported by a National Science Foundation Graduate Research Fellowship (DGE1745303).

DECLARATION OF INTERESTS

  • Solid line represents the mean coverage of three biological replicates, and shaded windows represent standard error of the mean.
  • (B) PRO-seq 3′ end coverage aligned to 5′SSs for all introns from active transcripts (dark purple), first introns only (light purple), middle introns (light orange), and terminal intron (dark orange).

Data and Code Availability

  • Raw and processed long-read sequencing and PRO-seq data generated in this study are deposited in NCBI’s Gene Expression Omnibus and are accessible through GEO Series accession number GSE144205.
  • Raw image data associated with this manuscript are available on Mendeley (http://dx.doi.org/10.17632/5vrtbpnj4k.1).
  • All code supporting the long-read sequencing data analysis in this manuscript is available at https://github.com/NeugebauerLab/MEL_LRS.
  • NanoCOP data from Drexler et al. 2020 analyzed in this manuscript can be found at GEO with accession number GSE123191, and total RNA-seq from MEL cells analyzed in this study can be found at Mouse ENCODE (http://www.mouseencode.org/) with accession number ENCSR000CWE.

Subcellular Fractionation

  • Subcellular fractionation was adapted from previously published protocols (Mayer and Churchman, 2017; Pandya-Jones and Black, 2009), with modifications to centrifugation speeds in order to retain intact nuclei (Reimer and Neugebauer, 2020).
  • All steps were performed on ice, and all buffers contained 25 uM αamanitin, 40 U/ml SUPERase.IN, and 1x Roche cOmplete protease inhibitor mix.
  • The supernatant (cytoplasm fraction) was removed, and the pellet were rinsed once with 500 μl PBS/1 mM EDTA.
  • Chromatin was immediately dissolved in 100 μl PBS and 300 μl TRIzol Reagent .

Nascent RNA Isolation

  • RNA was purified from chromatin pellets in TRIzol Reagent using the RNeasy Mini kit according to the manufacturer’s protocol, including the on-column DNase I digestion.
  • For genome-wide nascent RNA-seq, samples were depleted three times of polyA(+) RNA using the Dynabeads mRNA DIRECT Micro Purification Kit , each time keeping the supernatant, then depleted of ribosomal RNA using the Ribo-Zero Gold rRNA Removal Kit .
  • For targeted nascent RNA-seq, polyA(+) and rRNA depletion were omitted.

Western Blotting

  • Cytoplasm, nucleoplasm, and chromatin fractions from cell fractionation were adjusted to an equal volume with PBS.
  • Nucleoplasm and chromatin fractions were homogenized by sonication, and all samples were spun at 14,000 rpm for 10 min at 4°C before gel loading.
  • For primers used to amplify Hbb-b1 and Gapdh, see Table S2.
  • QPCR reactions were assembled using iQ SYBR Green Supermix and quantified on a Stratagene MX3000P qPCR machine.
  • Expression fold changes were calculated using the ΔΔCt method.

Microscopy

  • Live cells were imaged in bright field on an Olympus CKX41 microscope.
  • For total RNA samples, RNA was extracted from approximately 5 million cells treated with Pladienolide B as described above and using TRIzol Reagent according to the manufacturer’s protocol.
  • RNA was further depleted from this sample as described above.
  • PCR was performed using Phusion High-Fidelity DNA Polymerase (NEB) according to the manufacturer’s protocol.
  • For the list of intron-flanking primers used in these experiments, see Table S2.

Genome-wide nascent RNA sequencing

  • Mapped reads in SAM format were filtered to remove reads that contained a polyA tail using a custom script (available on Github).
  • Briefly, mapped reads that had soft-clipped bases at the 3′ end were discarded if the soft-clipped region of the read contained 4 or more A’s and the fraction of A’s was greater than 0.9.
  • Similarly, reads with soft-clipped bases at the 5′ end (resulting from minus strand reads) containing at least 4 T’s and having a fraction of T’s greater than 0.9 were discarded.

HBB targeted nascent RNA sequencing

  • Additional parameters were added to the above criteria for removing polyA-containing reads from targeted data mapped to the HBB locus based on empirical observation.
  • Since the HBB locus is integrated randomly in the MEL genome, long uncleaved transcripts that have coverage past the annotated HBB locus read into random genomic regions and cause long stretches of mismatched softclipped bases.
  • A custom script was used to filter polyA-containing reads but retain uncleaved transcripts (available on Github).
  • Uncleaved reads with long stretches of soft-clipped bases that passed this filtering were then recoded to contain a match in the CIGAR string downstream of the PAS in order to include these regions of the long-reads in coverage calculations.

PRO-seq Data Preprocessing

  • Cutadapt was used to trim paired-end reads to 40 nt, removing adapter sequence and low quality 3′ ends, and discarding reads that were shorter than 20 nt (-m20 -q 1).
  • Trimmed paired-end reads were first mapped to the Drosophila dm3 reference genome using Bowtie, and subsequent uniquely mapped reads to the dm3 genome were used to determine percent spike-in return across all samples.
  • Paired-end reads that failed to align to the dm3 genome were mapped to the mm10 reference genome.
  • Due to the “forward/reverse” orientation of Illumina paired-end sequencing, “+” and “-“ stranded bedGraph files were switched at the end of the pipeline (Mahat et al., 2016).
  • Since the spike-in return was comparable between biological replicates within a treatment type, and no comparisons were made between the two treatment conditions, no further normalizations were performed.

PRO-seq and total RNA-seq Data Analysis

  • A list of active transcripts in MEL cells was first generated using PRO-seq signal within a 300 nt window around annotated TSSs in the GENCODE mm10 vM20 annotation.
  • Additionally, if two intron annotations shared a 5′SS or 3′SS, the annotation with the most spliced reads was kept.
  • Violin plots evaluating PRO-seq 3′ end or RNA-seq read coverage were generated by summing the signal at the indicated positions with respect to the 5′SS, 3′SS or PAS.
  • P-values were calculated using either the Mann-Whitney or the Wilcoxon matched-pairs signed rank test Resulting reads were filtered to discard reads with an “N” size > 10,000 using pysam to remove poorly mapped reads or reads mapped across very large introns.

Splicing Status Classification and Co-transcriptional Splicing Efficiency (CoSE) Calculation

  • The annotation of introns contained in active transcripts (described above for PRO-seq), was first filtered for unique intron start and end coordinates.
  • If the junction was not present in the read, a 10 nt window was included in the search for the junction to allow for slight mismatches in alignments.
  • If the junction was not found, the intron was classified as unspliced.
  • To classify splicing status of each read, the number of spliced introns was compared to the total number of introns that was overlapped.
  • Introns with identical 5′SS or 3′SS were filtered to keep only the intron with the most total reads.

Distance from Splice Junction to 3′ End Calculation

  • Splicing intermediates (defined below), were filtered out from the long-read data in this analysis, since their 3′ ends do not represent the position of Pol II, but rather an upstream exon between step I and step II of splicing.
  • For all remaining reads, data in were filtered for reads that contained at least 1 splice junction, and then the last “block size”, which represents the distance from the most distal splice junction to the 3′ end of the read, was calculated.
  • Coordinates of the last spliced intron were also recorded, and each intron was matched to a transcript and categorized by gene biotype using mygene in python.
  • To determine if certain genes exhibited a longer or shorter distance from 3′ end to Pol II, the distance was split into three equal size categories and transcript IDs from each category were entered into the online PANTHER classification system: no significant enrichment was obtained.
  • Introns considered in this analysis were the same set of introns considered for CoSE as described above.

Long-read Coverage

  • Transcript coordinates associated with active TSSs (as described above) were obtained from UCSC.
  • Transcripts were then grouped by the parent Gene ID, and the largest range of start and end coordinates from the grouped transcripts was kept.
  • Library depth was then calculated using bedtools coverage across this file of collapsed active gene coordinates.
  • For coverage downstream of the PAS, long-reads were separated by splicing status (see below), then coverage was calucated using bedtools within a window around PASs that corresponded to active TSSs or specifically to a window around the HBB PAS.
  • Coverage at all positions was normalized to the coverage at the position 100 nt upstream of the PAS.

Uncleaved Transcripts Analysis

  • Bedtools intersect was used to identify long-reads with 5′ ends originating in a gene body of active transcripts (as described above).
  • Reads were then categorized as being uncleaved transcripts if their 3′ ends were greater than 50 nt downstream of the PAS of the gene which the 5′ end overlapped with.
  • Splicing status classification of uncleaved transcripts was carried out as described above.
  • For long-reads derived from HBBIVS-110(G>A) cells, only reads that were spliced at intron 1 using the cryptic splice site were analyzed, and the rare reads with a splice junction using the canonical splice site were discarded.
  • Splicing status classification, counting of splicing intermediates, and calculating coverage downstream of the PAS were performed as described above but with the custom HBB annotation coordinates.

QUANTIFICATION AND STATISTICAL ANALYSIS

  • All information about statistical testing for individual experiments can be found in figure legends, including statistical tests used, number of replicates, and number of observations.
  • Sample Sequencing Protocol Raw read number Mapped read number PolyA-filtered read number MEL_LRS_uninduced PacBio LRS 583,632 545,477 538,452 MEL_LRS_induced PacBio LRS.
  • RT primer for targeted first strand synthesis barcode 1 AAGCAGTGGTATCAACGCAGAGTACCACATATCAGAGTGCGGAT RT-PCR primer F C1qbp GACGTGTGCTCTTCCGATCTCACAGATTCCCTGGACTGG.

KEY RESOURCES TABLE

  • Antibodies Rabbit polyclonal anti-GAPDH Santa Cruz Biotechnology FL-335/sc-25778 Mouse monoclonal anti-Pol II Santa Cruz Biotechnology CTD4H8/sc-47701 Bacterial and Virus Strains Biological Samples Chemicals, Peptides, and Recombinant Proteins DMEM + GlutaMAX Gibco 10569-010 Fetal Bovine Serum (FBS) Gibco 16000-044 Penicillin Streptomycin Gibco 15140-122 α-Amanitin Sigma A2263 SUPERase.
  • This paper N/A Recombinant DNA Software and Algorithms Porechop v0.2.4 N/A.

Did you find this useful? Give us your feedback

Figures (5)

Content maybe subject to copyright    Report

1
Co-transcriptional splicing regulates 3 end cleavage during mammalian erythropoiesis
Kirsten A. Reimer
1
, Claudia Mimoso
2
, Karen Adelman
2
, and Karla M. Neugebauer
1*
1
Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, 06520, USA
2
Department of Biological Chemistry and Molecular Pharmacology, Blavatnik Institute, Harvard Medical
School, Boston, MA 02115, USA
*Correspondence: karla.neugebauer@yale.edu
ABSTRACT
Pre-mRNA processing steps are tightly coordinated with transcription in many organisms. To determine
how co-transcriptional splicing is integrated with transcription elongation and 3 end formation in
mammalian cells, we performed long-read sequencing of individual nascent RNAs and PRO-seq during
mouse erythropoiesis. Splicing was not accompanied by transcriptional pausing and was detected when
RNA polymerase II (Pol II) was within 75 300 nucleotides of 3 splice sites (3SSs), often during
transcription of the downstream exon. Interestingly, several hundred introns displayed abundant splicing
intermediates, suggesting that splicing delays can take place between the two catalytic steps. Overall,
splicing efficiencies were correlated among introns within the same transcript, and intron retention was
associated with inefficient 3 end cleavage. Remarkably, a thalassemia patient-derived mutation
introducing a cryptic 3SS improves both splicing and 3 end cleavage of individual β-globin transcripts,
demonstrating functional coupling between the two co-transcriptional processes as a determinant of
productive gene output.
Keywords: nascent RNA, erythropoiesis, globin, co-transcriptional splicing, PacBio, long read sequencing
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 14, 2020. ; https://doi.org/10.1101/2020.02.11.944595doi: bioRxiv preprint

2
INTRODUCTION
Transcription and pre-mRNA processing steps – 5 end capping, splicing, base modification, and 3 end
cleavage required for eukaryotic gene expression are each carried out by macromolecular machines.
The spliceosome assembles de novo on each intron, recognizing the 5 and 3 splice sites (SSs) that
demarcate intron boundaries and then catalyzing two transesterification reactions to excise introns and
ligate exons together (Wilkinson et al., 2020). In mammalian cells, genes typically encode pre-mRNAs
containing 8-10 introns of variable lengths, creating a high cellular demand for spliceosomes relative to
all of the other machineries, which only act once per transcript. Splicing is also a highly-regulated process;
it is influenced by environmental factors, developmental cues, and factors in the local pre-messenger
RNA (pre-mRNA) environment, such as RNA secondary structure and RNA-binding protein occupancy
(Baralle and Giudice, 2017; Jeong, 2017; Lin et al., 2016; Pai and Luca, 2019). The influence of trans-
acting factors on the selection of 5 and 3SSs is thought to explain how constitutive and alternative splice
sites are chosen. These working models still largely rely on in vitro biochemistry and often do not explain
changes in alternative splicing or overall gene expression observed upon experimental perturbation or
disease-associated mutations of splicing factors (Joshi et al., 2017; Manning and Cooper, 2017). Thus,
despite detailed knowledge of modulatory factors, the mechanisms underlying the gene regulatory
potential of pre-mRNA splicing are not fully understood in vivo.
Across species, tissues, and cell types, splicing occurs during pre-mRNA synthesis by Pol II (Custodio
and Carmo-Fonseca, 2016; Neugebauer, 2019). Thus, spliceosome assembly occurs as the nascent
RNA is growing longer and more diverse in sequence and structure. Spliceosomes may not assemble on
all of the introns at the same time, because promoter-proximal introns are synthesized before promoter-
distal introns. The questions of whether introns are spliced in the order they are transcribed and how
splicing of individual introns within a given transcript might be coordinated are currently the subject of
intense investigation. Co-transcriptional splicing also demands that the constellation of splicing factors
capable of regulating a splicing event bind the nascent RNA coordinately with the timing imposed by
transcription and in a relevant spatial window. For example, a splicing inhibitor element in a given nascent
RNA would only be influential if it were transcribed before the target intron was removed.
Recently, the Neugebauer
lab has used single-molecule sequencing approaches to determine how
splicing progresses as a function of transcription in budding and fission yeasts, where introns are
removed shortly after synthesis (Alpert et al., 2020; Carrillo Oesterreich et al., 2016; Herzel et al., 2018).
The approaches mark the nascent RNA’s 3 end, which is present in the catalytic center of Pol II, to
determine the position of Pol II when splicing occurs and define the sequence of the pre-mRNA substrate
acted on by the spliceosome. These data show that only a small portion of the downstream exon may be
needed for 3SS identification and splicing. Interestingly, altering the rate of Pol II elongation affects
splicing outcomes, including widespread changes in alternative splicing (Aslanzadeh et al., 2018; Braberg
et al., 2013; Carrillo Oesterreich et al., 2016; de la Mata et al., 2003; Fong et al., 2014; Ip et al., 2011;
Jonkers and Lis, 2015; Schor et al., 2013). Taken together, these findings suggest that transcription
elongation rate may govern the amount of downstream RNA available for cis regulation at the time that
splicing takes place. This in turn would determine which trans-acting regulatory factors could be recruited
to the nascent RNA to modulate splicing. To obtain mechanistic insights into these processes, we need
to understand how mammalian cells with many more introns per gene and vastly increased levels of
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 14, 2020. ; https://doi.org/10.1101/2020.02.11.944595doi: bioRxiv preprint

3
alternative splicing compared to yeast coordinate co-transcriptional splicing with transcription
elongation.
Another issue raised by co-transcriptional RNA processing is how splicing is coordinated with other pre-
mRNA processing steps (Bentley, 2014; Herzel et al., 2017). In recent long-read sequencing studies in
budding and fission yeasts (Alpert et al., 2020; Herzel et al., 2018), “all or none” splicing of individual
nascent transcripts was discovered, suggesting positive and negative cooperativity among neighboring
introns and polyA cleavage sites. Indeed, crosstalk among introns was observed in human cells at the
same time by others (Kim et al., 2017; Tilgner et al., 2018). However, those studies did not explore
coupling to 3 end formation. Cleavage of the nascent RNA by the cleavage and polyadenylation
machinery at polyA sites (PAS) releases the RNA from Pol II and the RNA is subsequently polyadenylated
(Kumar et al., 2019). Coupling between splicing and 3 end cleavage is important, because uncleaved
transcripts are degraded by the nuclear exosome in S. pombe (Herzel et al., 2018; Meola et al., 2016;
Zhou et al., 2015). Whether 3 end cleavage efficiency contributes to gene expression levels in
mammalian cells is currently unknown.
Here we report our analysis of nascent RNA transcription and splicing in murine erythroleukemia (MEL)
cells undergoing erythroid differentiation, a developmental program that exhibits well-known, drastic
changes in gene expression (An et al., 2014; Reimer and Neugebauer, 2018). We have employed two
single-molecule sequencing approaches to directly measure co-transcriptional splicing of nascent RNA:
(i) Long-read sequencing (LRS), which enables genome-wide analysis of splicing with respect to Pol II
position and (ii) Precision Run-On sequencing (PRO-seq), enabling the assessment of Pol II density at
these sites. We rigorously determine the spatial window in which co-transcriptional splicing occurs and
define co-transcriptional splicing efficiency for thousands of mouse introns, Pol II elongation behavior
across splice junctions, and the effects of efficient co-transcriptional splicing on 3 end cleavage. These
findings identify the pre-mRNA substrates of splicing and show that splicing of multiple introns within
individual transcripts is coordinated with 3 end cleavage. In particular, the demonstration of highly
efficient splicing in the absence of transcriptional pausing causes us to rethink key features of splicing
regulation in mammalian cells.
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 14, 2020. ; https://doi.org/10.1101/2020.02.11.944595doi: bioRxiv preprint

4
RESULTS
PacBio Long-read Sequencing of Nascent RNA Yields High Read Coverage
Murine erythroleukemia (MEL) cells are immortalized at the proerythroblast stage and can be induced to
enter terminal erythroid differentiation by treatment with 2% DMSO for five days (Antoniou, 1991).
Phenotypic changes include decreased cell volume, increased levels of β -globin, and visible
hemoglobinization (Figures S1A-C). We used chromatin purification of uninduced and induced MEL cells
to enrich for nascent RNA (Figure 1A). Chromatin purification under stringent washing conditions allows
release of contaminating RNAs and retains the stable ternary complex formed by elongating Pol II, DNA,
and nascent RNA (Figure S1D; (Wuarin and Schibler, 1994). Importantly, spliceosome assembly does
not continue during chromatin fractionation or RNA isolation, because the presence of the splicing
inhibitor Pladienolide B throughout the purification process does not change splicing levels (Figure S2).
To generate libraries for LRS, we established the protocol outlined in Figure 1A. Two biological
replicates, each with two technical replicates, were sequenced using PacBio RSII and Sequel flow cells,
yielding a total of 1,155,629 mappable reads (Table S1). Reads containing a non-templated polyA tail
comprised only 1.7% of the total reads (Table S1) and were removed bioinformatically along with
abundant 7SK RNA reads. Of the remaining reads, the average read length was 710 and 733 nucleotides
(nt), and the average coverage in reads per gene was 8.4 and 4.8 for uninduced and induced samples,
respectively (Figure 1B-C). More than 7,500 genes were represented by more than 10 reads per gene
in each condition (Figure 1C). Coverage of 5 ends was focused at annotated transcription start sites
(TSSs), with 18.3% of 5 ends within 50 bp of an active TSS across all samples. As expected, 3 end
coverage was distributed more evenly throughout gene bodies, with an increase just upstream of
annotated transcription end sites (TESs) and a drop after TESs (Figure S1E).
LRS Reveals Rapid and Efficient Co-transcriptional Splicing
Each long-read provides two critical pieces of information: the 3 end reveals the position of Pol II when
the RNA was isolated; the splice junctions reveal if splicing has occurred and which splice sites were
chosen. Here, we present our LRS data in a format that highlights 3 end position and the associated
splicing status (Figure 2A&B; Figure S3A). Each transcript was categorized and colored according to
its splicing status, which can be either “all spliced”, “partially spliced”, “all unspliced”, or “NA” (transcripts
that did not span an entire intron or a 3SS). For each gene, we calculated the fraction of long-reads that
were all spliced, partially spliced, or all unspliced (Figure 2A; bar plot far right), enabling a survey of
splicing behaviors within individual transcripts (Alpert et al., 2020; Herzel et al., 2018; Kim et al., 2017).
Splicing status of individual transcripts varied from gene to gene. For example, the gene Actb had mostly
all spliced reads (78% and 75% of reads in uninduced and induced cells respectively), while Calr and
Eif1 had a greater fraction of all unspliced reads (Figure 2B). Genome-wide, the majority of long-reads
were all spliced (Figure 2C; 68.0% and 73.8% for uninduced and induced cells, respectively), with an
average of 88% of all introns being spliced. Therefore, the majority of introns are removed co-
transcriptionally. To validate this finding, we examined the read length distribution for reads of each
splicing status (Figure S3B). As expected, partially spliced and all unspliced reads were longer than all
spliced reads due to the presence of introns, suggesting that the efficient shortening of nascent RNA due
to splicing limits the lengths of long-reads.
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 14, 2020. ; https://doi.org/10.1101/2020.02.11.944595doi: bioRxiv preprint

5
To quantify co-transcriptional splicing for each intron detected by at least 10 long-reads, we defined a
metric termed the Co-transcriptional Splicing Efficiency (CoSE), tabulated as the number of spliced reads
that span the intron divided by the total number of reads (spliced + unspliced) that span the intron (Figure
2D). A higher CoSE value indicates a higher fraction of co-transcriptional splicing. To validate this metric,
we analyzed an independently generated total RNA-seq dataset in uninduced MEL cells (downloaded
from ENCODE; (Davis et al., 2018)). Although nascent RNA is rare in total RNA, the density of reads
mapping to a given intron is expected to be inversely proportional to splicing efficiency. The ratio of intron-
mapping reads relative to the flanking exon-mapping reads was calculated for each intron and compared
to CoSE levels. As expected, higher CoSE corresponded to lower relative intron coverage in the total-
RNA seq data (Figure S3C). Thus, this independent data set validates the CoSE metric. CoSE values
also remained stable across all levels of read coverage (Figure S3D).
To determine if intron splicing events are coordinated within the same transcript, we asked how similar
CoSE values were between introns in the same transcript. To do so, transcripts containing at least 3
introns with recorded CoSE values (n = 2,028) were compiled. We found that the variance in CoSE
between introns within the same transcript was significantly smaller than the variance in CoSE for the
same number of randomly assorted introns (Figure 2E); these differences persisted when we analyzed
transcripts containing 3, 4, or 5 introns supported by long-reads (Figure S3E). Taken together, these
results suggest that most introns are well-spliced co-transcriptionally, and that splicing is coordinated in
mammalian multi-intron transcripts expressed by both uninduced and induced MEL cells.
The frequency of all-spliced nascent transcripts implies that splicing in mammalian cells is rapid enough
to match the rate of transcription. A direct way to address this is to measure the position of Pol II on
nascent RNA when ligated exons are observed. Observing Pol II downstream of a spliced junction
indicates that the active spliceosome has assembled and catalyzed splicing in the time it took for Pol II
to translocate the measured distance. Therefore, we determined the distance in nucleotides between the
3 end of each read and the nearest spliced exon-exon junction (Figure 3A). To eliminate 3 ends that
arise from splicing intermediates and not from active transcription, reads with 3 ends mapping precisely
to the last nt of exons were removed from this analysis. Although the longest distances between splice
junctions and elongating Pol II were just over 6 kb, these were rare. Instead, 75% of splice junctions were
within ~300 nt of a 3 end, and the median distance was 154 nt in uninduced cells and 128 nt in induced
cells (Figure 3B) Therefore, changes in the gene expression program during erythropoiesis did not alter
the dynamic relationship between transcription and splicing. Consistent with this, CoSE values were
similar when comparing induced to uninduced cells (Figure 3C; Spearman’s rho = 0.56). In fact, only 66
introns with improved splicing, and 42 introns with reduced splicing displayed > 2-fold change in CoSE
upon induction. Taken together, this analysis shows that although global changes in gene expression
take place between these two timepoints, the relationship between transcription and splicing remains the
same. Overall, these two measurements do not support major changes in splicing efficiency during
erythroid differentiation. Moreover, the distance from Pol II to the nearest splice junction was independent
of GO category or intron length (Figure S4B; GO analysis not shown). Because median exon size in the
mouse genome is 151 nt (Waterston et al., 2002), our data indicate that active spliceosomes can be fully
assembled and functional when Pol II is within or just downstream of the next transcribed exon. Recent
direct sequencing of nascent RNA seemed to reveal less rapid splicing (Drexler et al., 2020). However,
when we analyzed this dataset in the same manner as our own, the cumulative distance from Pol II to
the nearest splice junction is similarly close across organisms and cell types (median distance in human
.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted December 14, 2020. ; https://doi.org/10.1101/2020.02.11.944595doi: bioRxiv preprint

Citations
More filters
Journal ArticleDOI
TL;DR: The current principles of splicing regulation are summarized, including the impact of cis and trans regulatory elements, as well as the influence of chromatin structure, transcription, and RNA modifications.

47 citations

Journal ArticleDOI
TL;DR: The emerging molecular picture highlights how, compared to its yeast counterpart, the human spliceosome has coopted additional protein factors to allow increased plasticity of splice site recognition and remodeling, and potentially to regulate alternative splicing.

20 citations


Cites background from "Rapid and Efficient Co-Transcriptio..."

  • ...However, in mammals, the U2 and U1 snRNPs from adjacent introns are thought to interact across an exon in a process called exon definition [17], which has been proposed to control alternative splicing decisions [18], though some genome-wide studies suggest exon definition may only affect a subpopulation of introns [19,20]....

    [...]

  • ...com of the 30SS through interactions with the polypyrimidine tract (Figure 3h) and a recent genome-wide study implicates the polypyrimidine tract in controlling exon ligation of a subset of human introns [20]....

    [...]

Journal ArticleDOI
TL;DR: Xist requires sequence elements beyond its first two kilobases to robustly silence transcription, and the 5′ end of Xist harbors SPEN-independent transcriptional antiterminator activity that can repress proximal cleavage and polyadenylation.
Abstract: The Xist lncRNA requires Repeat A, a conserved RNA element located in its 5' end, to induce gene silencing during X-chromosome inactivation. Intriguingly, Repeat A is also required for production of Xist. While silencing by Repeat A requires the protein SPEN, how Repeat A promotes Xist production remains unclear. We report that in mouse embryonic stem cells, expression of a transgene comprising the first two kilobases of Xist (Xist-2kb) causes transcriptional readthrough of downstream polyadenylation sequences. Readthrough required Repeat A and the ∼750 nucleotides downstream, did not require SPEN, and was attenuated by splicing. Despite associating with SPEN and chromatin, Xist-2kb did not robustly silence transcription, whereas a 5.5-kb Xist transgene robustly silenced transcription and read through its polyadenylation sequence. Longer, spliced Xist transgenes also induced robust silencing yet terminated efficiently. Thus, in contexts examined here, Xist requires sequence elements beyond its first two kilobases to robustly silence transcription, and the 5' end of Xist harbors SPEN-independent transcriptional antiterminator activity that can repress proximal cleavage and polyadenylation. In endogenous contexts, this antiterminator activity may help produce full-length Xist RNA while rendering the Xist locus resistant to silencing by the same repressive complexes that the lncRNA recruits to other genes.

9 citations

Posted ContentDOI
14 Dec 2020-bioRxiv
TL;DR: A two-pass approach, combining alignment metrics and machine-learning-derived sequence information to filter spurious examples from splice junctions identified in long-read alignments, improves the accuracy of spliced alignment and transcriptome annotation without requiring orthogonal information from short read RNAseq or existing annotations.
Abstract: Transcription of eukaryotic genomes involves complex alternative processing of RNAs. Sequencing of full-length RNAs using long-reads reveals the true complexity of processing, however the relatively high error rates of long-read technologies can reduce the accuracy of intron identification. Here we present a two-pass approach, combining alignment metrics and machine-learning-derived sequence information to filter spurious examples from splice junctions identified in long-read alignments. The remaining junctions are then used to guide realignment. This method, available in the software package 2passtools (https://github.com/bartongroup/2passtools), improves the accuracy of spliced alignment and transcriptome annotation without requiring orthogonal information from short read RNAseq or existing annotations.

5 citations

Journal ArticleDOI
01 Dec 2020
TL;DR: How to isolate nascent RNA from mammalian cells through subcellular fractionation of chromatin‐associated RNA, as well as how to deplete poly(A)+ RNA and rRNA, and how to generate a full‐length cDNA library for use on long read sequencing platforms is described.
Abstract: Long read sequencing technologies now allow high-quality sequencing of RNAs (or their cDNAs) that are hundreds to thousands of nucleotides long. Long read sequences of nascent RNA provide single-nucleotide-resolution information about co-transcriptional RNA processing events-e.g., splicing, folding, and base modifications. Here, we describe how to isolate nascent RNA from mammalian cells through subcellular fractionation of chromatin-associated RNA, as well as how to deplete poly(A)+ RNA and rRNA, and, finally, how to generate a full-length cDNA library for use on long read sequencing platforms. This approach allows for an understanding of coordinated splicing status across multi-intron transcripts by revealing patterns of splicing or other RNA processing events that cannot be gained from traditional short read RNA sequencing. © 2020 Wiley Periodicals LLC. Basic Protocol 1: Subcellular fractionation Basic Protocol 2: Nascent RNA isolation and adapter ligation Basic Protocol 3: cDNA amplicon preparation.

4 citations

References
More filters
Journal ArticleDOI
TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.
Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

47,038 citations

Journal ArticleDOI
TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

45,957 citations

Journal ArticleDOI
TL;DR: The Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure outperforms other aligners by a factor of >50 in mapping speed.
Abstract: Motivation Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. Results To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. Availability and implementation STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.

30,684 citations

Journal ArticleDOI
TL;DR: Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.
Abstract: Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu.

20,335 citations

Journal ArticleDOI
TL;DR: The command-line tool cutadapt is developed, which supports 454, Illumina and SOLiD (color space) data, offers two adapter trimming algorithms, and has other useful features.
Abstract: When small RNA is sequenced on current sequencing machines, the resulting reads are usually longer than the RNA and therefore contain parts of the 3' adapter. That adapter must be found and removed error-tolerantly from each read before read mapping. Previous solutions are either hard to use or do not offer required features, in particular support for color space data. As an easy to use alternative, we developed the command-line tool cutadapt, which supports 454, Illumina and SOLiD (color space) data, offers two adapter trimming algorithms, and has other useful features. Cutadapt, including its MIT-licensed source code, is available for download at http://code.google.com/p/cutadapt/

20,255 citations

Frequently Asked Questions (2)
Q1. What are the contributions in "Co-transcriptional splicing regulates 3′ end cleavage during mammalian erythropoiesis" ?

To determine how co-transcriptional splicing is integrated with transcription elongation and 3′ end formation in mammalian cells, the authors performed long-read sequencing of individual nascent RNAs and PRO-seq during mouse erythropoiesis. Interestingly, several hundred introns displayed abundant splicing intermediates, suggesting that splicing delays can take place between the two catalytic steps. 

Future studies of these enigmatic new players may reveal a role for 3′SS diversity in the regulation of splicing by stalling between catalytic steps. Investigation of these mechanisms awaits future studies that would afford single transcript evaluation of the residence time of intron-bound inhibitory factors ( e. g. U1 snRNP ) coupled with splicing and cleavage outcome. Less efficient splicing can inhibit 3′ end cleavage ( Cooke et al., 1999 ; Davidson and West, 2013 ; Martins et al., 2011 ), suggesting that introns retained in transcripts that display readthrough harbor an inhibitory activity that represses 3′ end cleavage ( Figure 7E ). The authors speculate that this inhibitory activity persists longer on inefficiently spliced transcripts, potentially binding and inactivating 3′ end cleavage factors ( Deng et al., 2020 ; So et al., 2019 ).