Posted Content•DOI•

Rapid and Efficient Co-Transcriptional Splicing Enhances Mammalian Gene Expression

Kirsten A. Reimer¹, Claudia A. Mimoso², Karen Adelman², Karla M. Neugebauer¹•Institutions (2)

12 Feb 2020-bioRxiv (Cold Spring Harbor Laboratory)-

TL;DR: A patient-derived mutation in β-globin that causes thalassemia improves splicing efficiency and proper termination, revealing co-transcriptionalsplicing efficiency is a determinant of productive gene output.

read less

Abstract: Pre-mRNA splicing is tightly coordinated with transcription in yeasts, and introns can be removed soon after they emerge from RNA polymerase II (Pol II). To determine if splicing is similarly rapid and efficient in mammalian cells, we performed long read sequencing of nascent RNA during mouse erythropoiesis. Remarkably, 50% of splicing occurred while Pol II was within 150 nucleotides of 3′ splice sites. PRO-seq revealed that Pol II does not pause around splice sites, confirming that mammalian and yeast spliceosomes can act equally rapidly. Two exceptions were observed. First, several hundred introns displayed abundant splicing intermediates, suggesting that the spliceosome can stall after the first catalytic step. Second, some genes – notably globins – displayed poor splicing coupled to readthrough transcription. Remarkably, a patient-derived mutation in β-globin that causes thalassemia improves splicing efficiency and proper termination, revealing co-transcriptional splicing efficiency is a determinant of productive gene output.

...read moreread less

Summary (6 min read)

Jump to: [INTRODUCTION] – [PacBio Long-read Sequencing of Nascent RNA Yields High Read Coverage] – [LRS Reveals Rapid and Efficient Co-transcriptional Splicing] – [Unspliced Transcripts Display Poor Cleavage at Gene Ends] – [A β-thalassemia Mutation Enhances Splicing and 3′ End Cleavage Efficiencies] – [DISCUSSION] – [LIMITATIONS] – [ACKNOWLEDGMENTS] – [DECLARATION OF INTERESTS] – [Data and Code Availability] – [Subcellular Fractionation] – [Nascent RNA Isolation] – [Western Blotting] – [Microscopy] – [Genome-wide nascent RNA sequencing] – [HBB targeted nascent RNA sequencing] – [PRO-seq Data Preprocessing] – [PRO-seq and total RNA-seq Data Analysis] – [Splicing Status Classification and Co-transcriptional Splicing Efficiency (CoSE) Calculation] – [Distance from Splice Junction to 3′ End Calculation] – [Long-read Coverage] – [Uncleaved Transcripts Analysis] – [QUANTIFICATION AND STATISTICAL ANALYSIS] and [KEY RESOURCES TABLE]

INTRODUCTION

Transcription and pre-mRNA processing steps – 5′ end capping, splicing, base modification, and 3′ end cleavage – required for eukaryotic gene expression are each carried out by macromolecular machines.
Spliceosomes may not assemble on all of the introns at the same time, because promoter-proximal introns are synthesized before promoterdistal introns.
Co-transcriptional splicing also demands that the constellation of splicing factors capable of regulating a splicing event bind the nascent RNA coordinately with the timing imposed by transcription and in a relevant spatial window.
Those studies did not explore coupling to 3′ end formation.
Whether 3′ end cleavage efficiency contributes to gene expression levels in mammalian cells is currently unknown.

PacBio Long-read Sequencing of Nascent RNA Yields High Read Coverage

Murine erythroleukemia (MEL) cells are immortalized at the proerythroblast stage and can be induced to enter terminal erythroid differentiation by treatment with 2% DMSO for five days (Antoniou, 1991).
Phenotypic changes include decreased cell volume, increased levels of β-globin, and visible hemoglobinization .
Chromatin purification under stringent washing conditions allows release of contaminating RNAs and retains the stable ternary complex formed by elongating Pol II, DNA, and nascent RNA .
To generate libraries for LRS, the authors established the protocol outlined in Figure 1A.
More than 7,500 genes were represented by more than 10 reads per gene in each condition .

LRS Reveals Rapid and Efficient Co-transcriptional Splicing

Each long-read provides two critical pieces of information: the 3′ end reveals the position of Pol II when the RNA was isolated; the splice junctions reveal if splicing has occurred and which splice sites were chosen.
To validate this finding, the authors examined the read length distribution for reads of each splicing status .
One explanation for the relatively short distances observed between splice junctions and Pol II may be that Pol II pauses just downstream of an intron, allowing time for splicing to occur before elongation continues.
To control for the possibility that high PRO-seq density from TSS peaks might bleed through to the first 5′SS, first introns were independently analyzed.
To determine what features of specific introns might lead to increased splicing intermediates, the authors counted and normalized the number of splicing intermediates observed for each intron.

Unspliced Transcripts Display Poor Cleavage at Gene Ends

Consistent with physiological terminal erythroid differentiation, their induced MEL cells shifted to maximal expression of a- and β-globin genes, each containing two introns.
To their surprise, a large fraction of individual β-globin long-reads in the induced condition had 3′ ends that were up to 2.5 kb downstream of the annotated polyA site (PAS), indicating that these transcripts failed to undergo 3′ end cleavage at the PAS.
Notably, PRO-seq reads are commonly detected well past the gene 3′ ends due to transcription termination (Core et al., 2008).
Coverage of all unspliced reads was globally higher in the region downstream of a PAS than it was for partially spliced or all spliced reads .
This genome-wide decrease in splicing efficiency associated with impaired 3′ end cleavage confirmed the coordination between splicing and 3′ end processing prominently observed in the globin genes.

A β-thalassemia Mutation Enhances Splicing and 3′ End Cleavage Efficiencies

To investigate how mutations in splice sites alter co-transcriptional splicing efficiency, the authors took advantage of a known β-thalassemia allele.
This thalassemia-causing mutation, known as IVS-110, generates an HBB mRNA with an in-frame stop codon, resulting in a 90% reduction in functional HBB protein through nonsense-mediated decay (Spritz et al., 1981; Vadolas et al., 2006).
To rigorously test the possibility that changes in co-transcriptional splicing efficiency determine 3′ end cleavage, read coverage downstream of the HBB PAS was used to detect uncleaved long-reads for each category of splicing status .
All-unspliced HBB reads were detected up to 4 kb past the PAS, similar to endogenous mouse globin genes.
When only intron 2 was spliced, cleavage in MEL-HBB WT and MEL-HBB IVS-110(G>A) cells was similar .

DISCUSSION

This study reveals functional relationships between co-transcriptional RNA processing events through genome-wide analysis of individual nascent transcripts purified from differentiating mammalian erythroid cells.
Thus, spliceosome assembly and the transition to catalysis often occur when the spliceosome is physically close to Pol II.
The authors conclude that splicing more typically occurs when Pol is close to the intron.
The authors identified spliced reads within the PRO-seq data, validating the observations made with LRS of purified nascent RNA with an independent method.
The fraction of efficiently spliced -globin transcripts increased in the thalassemia allele the authors studied, even though the cryptic 3′SS yields an out of frame mRNA that will – like many thalassemia alleles of -globin – be degraded by nonsense-mediated decay (Kurosaki et al., 2019).

LIMITATIONS

First, the length of long-reads are dependent on reverse transcriptase processivity when copying RNA into cDNA.
While the authors have taken steps to enrich for full-length transcripts in their library generation, some RNAs are likely not fully reverse transcribed and captured in this dataset.
Second, the authors have not addressed directly what the ultimate fate of unspliced and uncleaved nascent RNA is in these cells.
Finally, a more rigorous test of their proposed mechanism linking splicing and 3′ end cleavage would require tools to probe inhibition of both processes.

ACKNOWLEDGMENTS

The authors thank P Patsali for sharing the MEL-HBB WT and MEL-HBB IVS-110(G>A) cell lines, M Antoniou for sharing an annotation of the GLOBE vector, and J Conboy for advice on erythroblast fractionation.
The authors thank E Brown for help with preparation of LRS figures, J Gordon for technical assistance, and H Tilgner, T Carrocci, D Phizicky, T Alpert, T Henriques, and B Martin for helpful discussions and comments on the manuscript.
This work was initiated through pilot funding from NIDDK under Grant U54DK106857 to the Yale Cooperative Center of Excellence in Hematology (to K.M.N.).
Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH.
K.A.R. is supported by a Postgraduate Scholarship from the Natural Sciences and Engineering Research Council of Canada and a Gruber Science Fellowship, and C.M. is supported by a National Science Foundation Graduate Research Fellowship (DGE1745303).

DECLARATION OF INTERESTS

Solid line represents the mean coverage of three biological replicates, and shaded windows represent standard error of the mean.
(B) PRO-seq 3′ end coverage aligned to 5′SSs for all introns from active transcripts (dark purple), first introns only (light purple), middle introns (light orange), and terminal intron (dark orange).

Data and Code Availability

Raw and processed long-read sequencing and PRO-seq data generated in this study are deposited in NCBI’s Gene Expression Omnibus and are accessible through GEO Series accession number GSE144205.
Raw image data associated with this manuscript are available on Mendeley (http://dx.doi.org/10.17632/5vrtbpnj4k.1).
All code supporting the long-read sequencing data analysis in this manuscript is available at https://github.com/NeugebauerLab/MEL_LRS.
NanoCOP data from Drexler et al. 2020 analyzed in this manuscript can be found at GEO with accession number GSE123191, and total RNA-seq from MEL cells analyzed in this study can be found at Mouse ENCODE (http://www.mouseencode.org/) with accession number ENCSR000CWE.

Subcellular Fractionation

Subcellular fractionation was adapted from previously published protocols (Mayer and Churchman, 2017; Pandya-Jones and Black, 2009), with modifications to centrifugation speeds in order to retain intact nuclei (Reimer and Neugebauer, 2020).
All steps were performed on ice, and all buffers contained 25 uM αamanitin, 40 U/ml SUPERase.IN, and 1x Roche cOmplete protease inhibitor mix.
The supernatant (cytoplasm fraction) was removed, and the pellet were rinsed once with 500 μl PBS/1 mM EDTA.
Chromatin was immediately dissolved in 100 μl PBS and 300 μl TRIzol Reagent .

Nascent RNA Isolation

RNA was purified from chromatin pellets in TRIzol Reagent using the RNeasy Mini kit according to the manufacturer’s protocol, including the on-column DNase I digestion.
For genome-wide nascent RNA-seq, samples were depleted three times of polyA(+) RNA using the Dynabeads mRNA DIRECT Micro Purification Kit , each time keeping the supernatant, then depleted of ribosomal RNA using the Ribo-Zero Gold rRNA Removal Kit .
For targeted nascent RNA-seq, polyA(+) and rRNA depletion were omitted.

Western Blotting

Cytoplasm, nucleoplasm, and chromatin fractions from cell fractionation were adjusted to an equal volume with PBS.
Nucleoplasm and chromatin fractions were homogenized by sonication, and all samples were spun at 14,000 rpm for 10 min at 4°C before gel loading.
For primers used to amplify Hbb-b1 and Gapdh, see Table S2.
QPCR reactions were assembled using iQ SYBR Green Supermix and quantified on a Stratagene MX3000P qPCR machine.
Expression fold changes were calculated using the ΔΔCt method.

Microscopy

Live cells were imaged in bright field on an Olympus CKX41 microscope.
For total RNA samples, RNA was extracted from approximately 5 million cells treated with Pladienolide B as described above and using TRIzol Reagent according to the manufacturer’s protocol.
RNA was further depleted from this sample as described above.
PCR was performed using Phusion High-Fidelity DNA Polymerase (NEB) according to the manufacturer’s protocol.
For the list of intron-flanking primers used in these experiments, see Table S2.

Genome-wide nascent RNA sequencing

Mapped reads in SAM format were filtered to remove reads that contained a polyA tail using a custom script (available on Github).
Briefly, mapped reads that had soft-clipped bases at the 3′ end were discarded if the soft-clipped region of the read contained 4 or more A’s and the fraction of A’s was greater than 0.9.
Similarly, reads with soft-clipped bases at the 5′ end (resulting from minus strand reads) containing at least 4 T’s and having a fraction of T’s greater than 0.9 were discarded.

HBB targeted nascent RNA sequencing

Additional parameters were added to the above criteria for removing polyA-containing reads from targeted data mapped to the HBB locus based on empirical observation.
Since the HBB locus is integrated randomly in the MEL genome, long uncleaved transcripts that have coverage past the annotated HBB locus read into random genomic regions and cause long stretches of mismatched softclipped bases.
A custom script was used to filter polyA-containing reads but retain uncleaved transcripts (available on Github).
Uncleaved reads with long stretches of soft-clipped bases that passed this filtering were then recoded to contain a match in the CIGAR string downstream of the PAS in order to include these regions of the long-reads in coverage calculations.

PRO-seq Data Preprocessing

Cutadapt was used to trim paired-end reads to 40 nt, removing adapter sequence and low quality 3′ ends, and discarding reads that were shorter than 20 nt (-m20 -q 1).
Trimmed paired-end reads were first mapped to the Drosophila dm3 reference genome using Bowtie, and subsequent uniquely mapped reads to the dm3 genome were used to determine percent spike-in return across all samples.
Paired-end reads that failed to align to the dm3 genome were mapped to the mm10 reference genome.
Due to the “forward/reverse” orientation of Illumina paired-end sequencing, “+” and “-“ stranded bedGraph files were switched at the end of the pipeline (Mahat et al., 2016).
Since the spike-in return was comparable between biological replicates within a treatment type, and no comparisons were made between the two treatment conditions, no further normalizations were performed.

PRO-seq and total RNA-seq Data Analysis

A list of active transcripts in MEL cells was first generated using PRO-seq signal within a 300 nt window around annotated TSSs in the GENCODE mm10 vM20 annotation.
Additionally, if two intron annotations shared a 5′SS or 3′SS, the annotation with the most spliced reads was kept.
Violin plots evaluating PRO-seq 3′ end or RNA-seq read coverage were generated by summing the signal at the indicated positions with respect to the 5′SS, 3′SS or PAS.
P-values were calculated using either the Mann-Whitney or the Wilcoxon matched-pairs signed rank test Resulting reads were filtered to discard reads with an “N” size > 10,000 using pysam to remove poorly mapped reads or reads mapped across very large introns.

Splicing Status Classification and Co-transcriptional Splicing Efficiency (CoSE) Calculation

The annotation of introns contained in active transcripts (described above for PRO-seq), was first filtered for unique intron start and end coordinates.
If the junction was not present in the read, a 10 nt window was included in the search for the junction to allow for slight mismatches in alignments.
If the junction was not found, the intron was classified as unspliced.
To classify splicing status of each read, the number of spliced introns was compared to the total number of introns that was overlapped.
Introns with identical 5′SS or 3′SS were filtered to keep only the intron with the most total reads.

Distance from Splice Junction to 3′ End Calculation

Splicing intermediates (defined below), were filtered out from the long-read data in this analysis, since their 3′ ends do not represent the position of Pol II, but rather an upstream exon between step I and step II of splicing.
For all remaining reads, data in were filtered for reads that contained at least 1 splice junction, and then the last “block size”, which represents the distance from the most distal splice junction to the 3′ end of the read, was calculated.
Coordinates of the last spliced intron were also recorded, and each intron was matched to a transcript and categorized by gene biotype using mygene in python.
To determine if certain genes exhibited a longer or shorter distance from 3′ end to Pol II, the distance was split into three equal size categories and transcript IDs from each category were entered into the online PANTHER classification system: no significant enrichment was obtained.
Introns considered in this analysis were the same set of introns considered for CoSE as described above.

Long-read Coverage

Transcript coordinates associated with active TSSs (as described above) were obtained from UCSC.
Transcripts were then grouped by the parent Gene ID, and the largest range of start and end coordinates from the grouped transcripts was kept.
Library depth was then calculated using bedtools coverage across this file of collapsed active gene coordinates.
For coverage downstream of the PAS, long-reads were separated by splicing status (see below), then coverage was calucated using bedtools within a window around PASs that corresponded to active TSSs or specifically to a window around the HBB PAS.
Coverage at all positions was normalized to the coverage at the position 100 nt upstream of the PAS.

Uncleaved Transcripts Analysis

Bedtools intersect was used to identify long-reads with 5′ ends originating in a gene body of active transcripts (as described above).
Reads were then categorized as being uncleaved transcripts if their 3′ ends were greater than 50 nt downstream of the PAS of the gene which the 5′ end overlapped with.
Splicing status classification of uncleaved transcripts was carried out as described above.
For long-reads derived from HBBIVS-110(G>A) cells, only reads that were spliced at intron 1 using the cryptic splice site were analyzed, and the rare reads with a splice junction using the canonical splice site were discarded.
Splicing status classification, counting of splicing intermediates, and calculating coverage downstream of the PAS were performed as described above but with the custom HBB annotation coordinates.

QUANTIFICATION AND STATISTICAL ANALYSIS

All information about statistical testing for individual experiments can be found in figure legends, including statistical tests used, number of replicates, and number of observations.
Sample Sequencing Protocol Raw read number Mapped read number PolyA-filtered read number MEL_LRS_uninduced PacBio LRS 583,632 545,477 538,452 MEL_LRS_induced PacBio LRS.
RT primer for targeted first strand synthesis barcode 1 AAGCAGTGGTATCAACGCAGAGTACCACATATCAGAGTGCGGAT RT-PCR primer F C1qbp GACGTGTGCTCTTCCGATCTCACAGATTCCCTGGACTGG.

KEY RESOURCES TABLE

Antibodies Rabbit polyclonal anti-GAPDH Santa Cruz Biotechnology FL-335/sc-25778 Mouse monoclonal anti-Pol II Santa Cruz Biotechnology CTD4H8/sc-47701 Bacterial and Virus Strains Biological Samples Chemicals, Peptides, and Recombinant Proteins DMEM + GlutaMAX Gibco 10569-010 Fetal Bovine Serum (FBS) Gibco 16000-044 Penicillin Streptomycin Gibco 15140-122 α-Amanitin Sigma A2263 SUPERase.
This paper N/A Recombinant DNA Software and Algorithms Porechop v0.2.4 N/A.

Did you find this useful? Give us your feedback

Figures (5)

Figure 2. Individual mammalian nascent RNA sequences reveal coordination of co-transcriptional splicing. (A) LRS data visualization for analysis of co-transcriptional splicing. Gene diagram is shown at the top, with the black arrow indicating the TSS. Reads are aligned to the genome and ordered by 3′ end position. Color code indicates the splicing status of each transcript. Each horizontal row represents one read. Panels at far right and below: regions of missing sequence (e.g. spliced introns) are transparent. Light

Figure 7. Efficient splicing promotes 3′ end cleavage (A) Top: schematic describing two engineered MEL cell lines. MEL-HBB WT contains an integrated copy of a wild type human globin minigene. In MEL-HBB IVS-110(G>A), a single point mutation (red triangle) mimics a disease-causing thalassemia allele. Bottom: Sanger sequencing of the HBB minigene coding strand shows that a G>A mutation leads to a cryptic 3′SS at the AG dinucleotide 19 nt upstream of the canonical 3′SS. (B) Distribution of HBB long-reads in MEL-HBB WT cells (purple) and MEL-HBB IVS-110(G>A) cells (orange) separated by splicing status of intron 1 and intron 2 and measured as a fraction of total reads mapped to the HBB gene (n = 20,395 reads in MEL-HBB WT cells, and n = 26,244 reads in MEL-HBB IVS110(G>A) cells). (C) Fraction of splicing intermediates at intron 1 and intron 2 in MEL-HBB WT cells (purple) and MEL-HBB IVS-110(G>A) cells (orange) measured as a fraction of total reads mapped to the HBB gene. For (B-C), significance tested by Mann Whitney U-test; *** represents p-value < 0.001, bar height represents the mean of three biological replicates, and error bars represent standard error of the mean. (D) Read coverage in the region downstream of the HBB PAS is shown for long-reads separated by their splicing status from MEL-HBB WT cells (purple) and MEL-HBB IVS-110(G>A) cells (orange). Coverage is

Figure 1. Long-read sequencing of nascent RNA from differentiating mouse erythroblasts (A) Schematic of nascent RNA isolation and sequencing library generation. MEL cells are treated with 2% DMSO to induce erythroid differentiation, cells are fractionated to purify chromatin, and chromatinassociated nascent RNA is depleted of polyadenylated and ribosomal RNAs. An adapter is ligated to the 3′ ends of remaining RNAs, then a strand-switching reverse transcriptase is used to create doublestranded cDNA that is the input for PacBio library preparation. (B) Read length and (C) read depth distribution of PacBio long-reads. See also Figures S1 and S2, and Table S1.

Figure 4. Pol II does not pause at 5′ or 3′ splice sites. (A) PRO-seq 3′ end coverage is shown aligned to active transcription start sites (TSS), 5′ splice sites (5′SS), and 3′ splice sites (3′SS). (B) Top: Schematic illustrating the use of color-coded intervals to quantify PRO-seq reads around each 5′SS and 3′SS to test for significance of pausing. Bottom: PROseq read density summed in each of the intervals indicated above around 5′SSs (left) and 3′SSs (right) from introns with at least 10 reads in uninduced conditions (n = 3,505). Significance tested by paired ttest; *** represents p-value < 0.001, ns represents p-value > 0.05. (C) Genome browser view showing spliced PRO-seq reads aligned to the Apbb1 gene, where 3′ ends of reads represent the position of elongating Pol II. Only spliced reads, filtered from all reads, are shown. See also Figure S5.

Figure 5. Splicing intermediates are abundant at introns with weak 3′ splice sites (A) Schematic definition of first step splicing intermediates (dotted red oval), which have undergone the first step of splicing and have a free 3′-OH that can be ligated to the 3′ end DNA adapter. Splicing intermediate reads are characterized by a 3′ end at the last nucleotide of the upstream exon. (B) Coverage of long-read 3′ ends (top panels) and 5′ ends (bottom panels) aligned to 5′SSs (left) and 3′SSs (right) of introns. (C) Coverage of long-read 3′ ends across four example genes. Arrows indicate the positions where the most abundant splicing intermediates are observed. (D) Individual long-reads are shown for the gene Alas2. Diagram is similar to Figure 2, but individual reads are colored depending on whether they are splicing intermediates (purple) or not (gray). Data for uninduced and induced cells are shown combined. Potential recursive splicing site is indicated by an arrow and dotted line; recursively spliced reads are shown in detail in (E). (F) MaxEnt splice site scores for 5′SS (left) and 3′SS (right) for

Content maybe subject to copyright Report

Co-transcriptional splicing regulates 3′ end cleavage during mammalian erythropoiesis

Kirsten A. Reimer

, Claudia Mimoso

, Karen Adelman

, and Karla M. Neugebauer

Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, 06520, USA

Department of Biological Chemistry and Molecular Pharmacology, Blavatnik Institute, Harvard Medical

School, Boston, MA 02115, USA

*Correspondence: karla.neugebauer@yale.edu

ABSTRACT

Pre-mRNA processing steps are tightly coordinated with transcription in many organisms. To determine

how co-transcriptional splicing is integrated with transcription elongation and 3′ end formation in

mammalian cells, we performed long-read sequencing of individual nascent RNAs and PRO-seq during

mouse erythropoiesis. Splicing was not accompanied by transcriptional pausing and was detected when

RNA polymerase II (Pol II) was within 75 – 300 nucleotides of 3′ splice sites (3′SSs), often during

transcription of the downstream exon. Interestingly, several hundred introns displayed abundant splicing

intermediates, suggesting that splicing delays can take place between the two catalytic steps. Overall,

splicing efficiencies were correlated among introns within the same transcript, and intron retention was

associated with inefficient 3′ end cleavage. Remarkably, a thalassemia patient-derived mutation

introducing a cryptic 3′SS improves both splicing and 3′ end cleavage of individual β-globin transcripts,

demonstrating functional coupling between the two co-transcriptional processes as a determinant of

productive gene output.

Keywords: nascent RNA, erythropoiesis, globin, co-transcriptional splicing, PacBio, long read sequencing

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted December 14, 2020. ; https://doi.org/10.1101/2020.02.11.944595doi: bioRxiv preprint

INTRODUCTION

Transcription and pre-mRNA processing steps – 5′ end capping, splicing, base modification, and 3′ end

cleavage – required for eukaryotic gene expression are each carried out by macromolecular machines.

The spliceosome assembles de novo on each intron, recognizing the 5′ and 3′ splice sites (SSs) that

demarcate intron boundaries and then catalyzing two transesterification reactions to excise introns and

ligate exons together (Wilkinson et al., 2020). In mammalian cells, genes typically encode pre-mRNAs

containing 8-10 introns of variable lengths, creating a high cellular demand for spliceosomes relative to

all of the other machineries, which only act once per transcript. Splicing is also a highly-regulated process;

it is influenced by environmental factors, developmental cues, and factors in the local pre-messenger

RNA (pre-mRNA) environment, such as RNA secondary structure and RNA-binding protein occupancy

(Baralle and Giudice, 2017; Jeong, 2017; Lin et al., 2016; Pai and Luca, 2019). The influence of trans-

acting factors on the selection of 5′ and 3′SSs is thought to explain how constitutive and alternative splice

sites are chosen. These working models still largely rely on in vitro biochemistry and often do not explain

changes in alternative splicing or overall gene expression observed upon experimental perturbation or

disease-associated mutations of splicing factors (Joshi et al., 2017; Manning and Cooper, 2017). Thus,

despite detailed knowledge of modulatory factors, the mechanisms underlying the gene regulatory

potential of pre-mRNA splicing are not fully understood in vivo.

Across species, tissues, and cell types, splicing occurs during pre-mRNA synthesis by Pol II (Custodio

and Carmo-Fonseca, 2016; Neugebauer, 2019). Thus, spliceosome assembly occurs as the nascent

RNA is growing longer and more diverse in sequence and structure. Spliceosomes may not assemble on

all of the introns at the same time, because promoter-proximal introns are synthesized before promoter-

distal introns. The questions of whether introns are spliced in the order they are transcribed and how

splicing of individual introns within a given transcript might be coordinated are currently the subject of

intense investigation. Co-transcriptional splicing also demands that the constellation of splicing factors

capable of regulating a splicing event bind the nascent RNA coordinately with the timing imposed by

transcription and in a relevant spatial window. For example, a splicing inhibitor element in a given nascent

RNA would only be influential if it were transcribed before the target intron was removed.

Recently, the Neugebauer

lab has used single-molecule sequencing approaches to determine how

splicing progresses as a function of transcription in budding and fission yeasts, where introns are

removed shortly after synthesis (Alpert et al., 2020; Carrillo Oesterreich et al., 2016; Herzel et al., 2018).

The approaches mark the nascent RNA’s 3′ end, which is present in the catalytic center of Pol II, to

determine the position of Pol II when splicing occurs and define the sequence of the pre-mRNA substrate

acted on by the spliceosome. These data show that only a small portion of the downstream exon may be

needed for 3′SS identification and splicing. Interestingly, altering the rate of Pol II elongation affects

splicing outcomes, including widespread changes in alternative splicing (Aslanzadeh et al., 2018; Braberg

et al., 2013; Carrillo Oesterreich et al., 2016; de la Mata et al., 2003; Fong et al., 2014; Ip et al., 2011;

Jonkers and Lis, 2015; Schor et al., 2013). Taken together, these findings suggest that transcription

elongation rate may govern the amount of downstream RNA available for cis regulation at the time that

splicing takes place. This in turn would determine which trans-acting regulatory factors could be recruited

to the nascent RNA to modulate splicing. To obtain mechanistic insights into these processes, we need

to understand how mammalian cells – with many more introns per gene and vastly increased levels of

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted December 14, 2020. ; https://doi.org/10.1101/2020.02.11.944595doi: bioRxiv preprint

alternative splicing compared to yeast – coordinate co-transcriptional splicing with transcription

elongation.

Another issue raised by co-transcriptional RNA processing is how splicing is coordinated with other pre-

mRNA processing steps (Bentley, 2014; Herzel et al., 2017). In recent long-read sequencing studies in

budding and fission yeasts (Alpert et al., 2020; Herzel et al., 2018), “all or none” splicing of individual

nascent transcripts was discovered, suggesting positive and negative cooperativity among neighboring

introns and polyA cleavage sites. Indeed, crosstalk among introns was observed in human cells at the

same time by others (Kim et al., 2017; Tilgner et al., 2018). However, those studies did not explore

coupling to 3′ end formation. Cleavage of the nascent RNA by the cleavage and polyadenylation

machinery at polyA sites (PAS) releases the RNA from Pol II and the RNA is subsequently polyadenylated

(Kumar et al., 2019). Coupling between splicing and 3′ end cleavage is important, because uncleaved

transcripts are degraded by the nuclear exosome in S. pombe (Herzel et al., 2018; Meola et al., 2016;

Zhou et al., 2015). Whether 3′ end cleavage efficiency contributes to gene expression levels in

mammalian cells is currently unknown.

Here we report our analysis of nascent RNA transcription and splicing in murine erythroleukemia (MEL)

cells undergoing erythroid differentiation, a developmental program that exhibits well-known, drastic

changes in gene expression (An et al., 2014; Reimer and Neugebauer, 2018). We have employed two

single-molecule sequencing approaches to directly measure co-transcriptional splicing of nascent RNA:

(i) Long-read sequencing (LRS), which enables genome-wide analysis of splicing with respect to Pol II

position and (ii) Precision Run-On sequencing (PRO-seq), enabling the assessment of Pol II density at

these sites. We rigorously determine the spatial window in which co-transcriptional splicing occurs and

define co-transcriptional splicing efficiency for thousands of mouse introns, Pol II elongation behavior

across splice junctions, and the effects of efficient co-transcriptional splicing on 3′ end cleavage. These

findings identify the pre-mRNA substrates of splicing and show that splicing of multiple introns within

individual transcripts is coordinated with 3′ end cleavage. In particular, the demonstration of highly

efficient splicing in the absence of transcriptional pausing causes us to rethink key features of splicing

regulation in mammalian cells.

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted December 14, 2020. ; https://doi.org/10.1101/2020.02.11.944595doi: bioRxiv preprint

RESULTS

PacBio Long-read Sequencing of Nascent RNA Yields High Read Coverage

Murine erythroleukemia (MEL) cells are immortalized at the proerythroblast stage and can be induced to

enter terminal erythroid differentiation by treatment with 2% DMSO for five days (Antoniou, 1991).

Phenotypic changes include decreased cell volume, increased levels of β -globin, and visible

hemoglobinization (Figures S1A-C). We used chromatin purification of uninduced and induced MEL cells

to enrich for nascent RNA (Figure 1A). Chromatin purification under stringent washing conditions allows

release of contaminating RNAs and retains the stable ternary complex formed by elongating Pol II, DNA,

and nascent RNA (Figure S1D; (Wuarin and Schibler, 1994). Importantly, spliceosome assembly does

not continue during chromatin fractionation or RNA isolation, because the presence of the splicing

inhibitor Pladienolide B throughout the purification process does not change splicing levels (Figure S2).

To generate libraries for LRS, we established the protocol outlined in Figure 1A. Two biological

replicates, each with two technical replicates, were sequenced using PacBio RSII and Sequel flow cells,

yielding a total of 1,155,629 mappable reads (Table S1). Reads containing a non-templated polyA tail

comprised only 1.7% of the total reads (Table S1) and were removed bioinformatically along with

abundant 7SK RNA reads. Of the remaining reads, the average read length was 710 and 733 nucleotides

(nt), and the average coverage in reads per gene was 8.4 and 4.8 for uninduced and induced samples,

respectively (Figure 1B-C). More than 7,500 genes were represented by more than 10 reads per gene

in each condition (Figure 1C). Coverage of 5′ ends was focused at annotated transcription start sites

(TSSs), with 18.3% of 5′ ends within 50 bp of an active TSS across all samples. As expected, 3′ end

coverage was distributed more evenly throughout gene bodies, with an increase just upstream of

annotated transcription end sites (TESs) and a drop after TESs (Figure S1E).

LRS Reveals Rapid and Efficient Co-transcriptional Splicing

Each long-read provides two critical pieces of information: the 3′ end reveals the position of Pol II when

the RNA was isolated; the splice junctions reveal if splicing has occurred and which splice sites were

chosen. Here, we present our LRS data in a format that highlights 3′ end position and the associated

splicing status (Figure 2A&B; Figure S3A). Each transcript was categorized and colored according to

its splicing status, which can be either “all spliced”, “partially spliced”, “all unspliced”, or “NA” (transcripts

that did not span an entire intron or a 3′SS). For each gene, we calculated the fraction of long-reads that

were all spliced, partially spliced, or all unspliced (Figure 2A; bar plot far right), enabling a survey of

splicing behaviors within individual transcripts (Alpert et al., 2020; Herzel et al., 2018; Kim et al., 2017).

Splicing status of individual transcripts varied from gene to gene. For example, the gene Actb had mostly

all spliced reads (78% and 75% of reads in uninduced and induced cells respectively), while Calr and

Eif1 had a greater fraction of all unspliced reads (Figure 2B). Genome-wide, the majority of long-reads

were all spliced (Figure 2C; 68.0% and 73.8% for uninduced and induced cells, respectively), with an

average of 88% of all introns being spliced. Therefore, the majority of introns are removed co-

transcriptionally. To validate this finding, we examined the read length distribution for reads of each

splicing status (Figure S3B). As expected, partially spliced and all unspliced reads were longer than all

spliced reads due to the presence of introns, suggesting that the efficient shortening of nascent RNA due

to splicing limits the lengths of long-reads.

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted December 14, 2020. ; https://doi.org/10.1101/2020.02.11.944595doi: bioRxiv preprint

To quantify co-transcriptional splicing for each intron detected by at least 10 long-reads, we defined a

metric termed the Co-transcriptional Splicing Efficiency (CoSE), tabulated as the number of spliced reads

that span the intron divided by the total number of reads (spliced + unspliced) that span the intron (Figure

2D). A higher CoSE value indicates a higher fraction of co-transcriptional splicing. To validate this metric,

we analyzed an independently generated total RNA-seq dataset in uninduced MEL cells (downloaded

from ENCODE; (Davis et al., 2018)). Although nascent RNA is rare in total RNA, the density of reads

mapping to a given intron is expected to be inversely proportional to splicing efficiency. The ratio of intron-

mapping reads relative to the flanking exon-mapping reads was calculated for each intron and compared

to CoSE levels. As expected, higher CoSE corresponded to lower relative intron coverage in the total-

RNA seq data (Figure S3C). Thus, this independent data set validates the CoSE metric. CoSE values

also remained stable across all levels of read coverage (Figure S3D).

To determine if intron splicing events are coordinated within the same transcript, we asked how similar

CoSE values were between introns in the same transcript. To do so, transcripts containing at least 3

introns with recorded CoSE values (n = 2,028) were compiled. We found that the variance in CoSE

between introns within the same transcript was significantly smaller than the variance in CoSE for the

same number of randomly assorted introns (Figure 2E); these differences persisted when we analyzed

transcripts containing 3, 4, or 5 introns supported by long-reads (Figure S3E). Taken together, these

results suggest that most introns are well-spliced co-transcriptionally, and that splicing is coordinated in

mammalian multi-intron transcripts expressed by both uninduced and induced MEL cells.

The frequency of all-spliced nascent transcripts implies that splicing in mammalian cells is rapid enough

to match the rate of transcription. A direct way to address this is to measure the position of Pol II on

nascent RNA when ligated exons are observed. Observing Pol II downstream of a spliced junction

indicates that the active spliceosome has assembled and catalyzed splicing in the time it took for Pol II

to translocate the measured distance. Therefore, we determined the distance in nucleotides between the

3′ end of each read and the nearest spliced exon-exon junction (Figure 3A). To eliminate 3′ ends that

arise from splicing intermediates and not from active transcription, reads with 3′ ends mapping precisely

to the last nt of exons were removed from this analysis. Although the longest distances between splice

junctions and elongating Pol II were just over 6 kb, these were rare. Instead, 75% of splice junctions were

within ~300 nt of a 3′ end, and the median distance was 154 nt in uninduced cells and 128 nt in induced

cells (Figure 3B) Therefore, changes in the gene expression program during erythropoiesis did not alter

the dynamic relationship between transcription and splicing. Consistent with this, CoSE values were

similar when comparing induced to uninduced cells (Figure 3C; Spearman’s rho = 0.56). In fact, only 66

introns with improved splicing, and 42 introns with reduced splicing displayed > 2-fold change in CoSE

upon induction. Taken together, this analysis shows that although global changes in gene expression

take place between these two timepoints, the relationship between transcription and splicing remains the

same. Overall, these two measurements do not support major changes in splicing efficiency during

erythroid differentiation. Moreover, the distance from Pol II to the nearest splice junction was independent

of GO category or intron length (Figure S4B; GO analysis not shown). Because median exon size in the

mouse genome is 151 nt (Waterston et al., 2002), our data indicate that active spliceosomes can be fully

assembled and functional when Pol II is within or just downstream of the next transcribed exon. Recent

direct sequencing of nascent RNA seemed to reveal less rapid splicing (Drexler et al., 2020). However,

when we analyzed this dataset in the same manner as our own, the cumulative distance from Pol II to

the nearest splice junction is similarly close across organisms and cell types (median distance in human

.CC-BY-NC-ND 4.0 International licenseavailable under a

(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprintthis version posted December 14, 2020. ; https://doi.org/10.1101/2020.02.11.944595doi: bioRxiv preprint

HTML Viewer

Frequently Asked Questions (2)

Q1. What are the contributions in "Co-transcriptional splicing regulates 3′ end cleavage during mammalian erythropoiesis" ?

To determine how co-transcriptional splicing is integrated with transcription elongation and 3′ end formation in mammalian cells, the authors performed long-read sequencing of individual nascent RNAs and PRO-seq during mouse erythropoiesis. Interestingly, several hundred introns displayed abundant splicing intermediates, suggesting that splicing delays can take place between the two catalytic steps.

Q2. What are the future works mentioned in the paper "Co-transcriptional splicing regulates 3′ end cleavage during mammalian erythropoiesis" ?

Future studies of these enigmatic new players may reveal a role for 3′SS diversity in the regulation of splicing by stalling between catalytic steps. Investigation of these mechanisms awaits future studies that would afford single transcript evaluation of the residence time of intron-bound inhibitory factors ( e. g. U1 snRNP ) coupled with splicing and cleavage outcome. Less efficient splicing can inhibit 3′ end cleavage ( Cooke et al., 1999 ; Davidson and West, 2013 ; Martins et al., 2011 ), suggesting that introns retained in transcripts that display readthrough harbor an inhibitory activity that represses 3′ end cleavage ( Figure 7E ). The authors speculate that this inhibitory activity persists longer on inefficiently spliced transcripts, potentially binding and inactivating 3′ end cleavage factors ( Deng et al., 2020 ; So et al., 2019 ).

Rapid and Efficient Co-Transcriptional Splicing Enhances Mammalian Gene Expression

Summary (6 min read)

INTRODUCTION

PacBio Long-read Sequencing of Nascent RNA Yields High Read Coverage

LRS Reveals Rapid and Efficient Co-transcriptional Splicing

Unspliced Transcripts Display Poor Cleavage at Gene Ends

A β-thalassemia Mutation Enhances Splicing and 3′ End Cleavage Efficiencies

DISCUSSION

LIMITATIONS

ACKNOWLEDGMENTS

DECLARATION OF INTERESTS

Data and Code Availability

Subcellular Fractionation

Nascent RNA Isolation

Western Blotting

Microscopy

Genome-wide nascent RNA sequencing

HBB targeted nascent RNA sequencing

PRO-seq Data Preprocessing

PRO-seq and total RNA-seq Data Analysis

Splicing Status Classification and Co-transcriptional Splicing Efficiency (CoSE) Calculation

Distance from Splice Junction to 3′ End Calculation

Long-read Coverage

Uncleaved Transcripts Analysis

QUANTIFICATION AND STATISTICAL ANALYSIS

KEY RESOURCES TABLE

Figures (5)

Citations

Cites background from "Rapid and Efficient Co-Transcriptio..."

References

Related Papers (5)

Frequently Asked Questions (2)

Q1. What are the contributions in "Co-transcriptional splicing regulates 3′ end cleavage during mammalian erythropoiesis" ?

Q2. What are the future works mentioned in the paper "Co-transcriptional splicing regulates 3′ end cleavage during mammalian erythropoiesis" ?