Normalization of Single Cell RNA Sequencing Data Using both Control and Target Genes
TL;DR: This work develops an alternative statistical method, which it refers to as scPLS, for more accurate inference of confounding effects, based on partial least squares and models control and target genes jointly to better infer and control for confounding effects.
Abstract: Single cell RNA sequencing (scRNAseq) technique is becoming increasingly popular for unbiased and high-resolutional transcriptome analysis of heterogeneous cell populations. Despite its many advantages, scRNAseq, like any other genomic sequencing technique, is susceptible to the influence of confounding effects. Controlling for confounding effects in scRNAseq data is thus a crucial step for proper data normalization and accurate downstream analysis. Several recent methodological studies have demonstrated the use of control genes for controlling for confounding effects in scRNAseq studies; the control genes are used to infer the confounding effects, which are then used to normalize target genes of primary interest. However, these methods can be suboptimal as they ignore the rich information contained in the target genes. Here, we develop an alternative statistical method, which we refer to as scPLS, for more accurate inference of confounding effects. Our method is based on partial least squares and models control and target genes jointly to better infer and control for confounding effects. To accompany our method, we develop a novel expectation maximization algorithm for scalable inference. Our algorithm is an order of magnitude faster than standard ones, making scPLS applicable to hundreds of cells and hundreds of thousands of genes. With extensive simulations and comparisons with other methods, we demonstrate the effectiveness of scPLS. We apply scPLS to analyze three scRNAseq data sets to further illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects.
Summary (2 min read)
- Single cell RNA sequencing technique is becoming increasingly popular for unbiased and high-resolutional transcriptome analysis of heterogeneous cell populations.
- Finally, the authors apply scPLS to analyze two scRNAseq data sets to illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects.
- These hidden confounding factors can cause systematic bias, are notoriously difficult to control for, and are the focus of the present study.
- In the Simulations section the authors present comparisons between scPLS and several existing methods using simulations.
Review of Previous Methods
- Many statistical methods have been developed in sequencing- and array-based genomic studies to infer hidden confounding factors and control for hidden confounding effects.
- The application-specific methods become inconvenient in cases where there are multiple variables of interest (e.g. in eQTL mapping problems).
- In contrast, the second subcategory of unsupervised methods are recently developed to take advantage of a set of control genes for inferring the confounding factors29,37.
- Similarly, most scRNAseq studies include a set of control genes that are known to have varying expression levels across cell cycles.
- The two subcategories of unsupervised methods use different strategies to infer the confounding factors.
- The authors provide modeling details for scPLS here.
- Consistent with the clustering performance comparison, the authors found that scPLS also yielded more accurate proportion of variance estimates (Fig. 2b).
- In the target genes, the confounding factors and structured biological factors explain a median of 18% and 30% of gene expression variance, respectively.
- The results are shown in Table 1 and are overall consistent with the simulations.
- To demonstrate its effectiveness there, the authors applied scPLS and several other methods to a second dataset that was used for demonstrating cell cycle influence37.
- The authors have presented scPLS for removing hidden confounding effects in scRNAseq studies.
- Importantly, the performance of scPLS is robust to the number of genes included in the control set and yields comparable results even when a much smaller number of control genes is used.
- In fact, low-rank factors inferred from many data sets using standard factor models have been linked to important biological pathways or transcription factors42–46.
- The authors have been mainly focused on comparing the performance of different confounding effects removing methods by evaluating the clustering performance as the target downstream analysis.
- Like many other methods for scRNAseq21 or bulk58,59 RNAseq studies, scPLS requires a data transformation step that converts the count data into quantitative expression data.
- The authors list the EM algorithm below, with detailed derivation provided later.
- The naive EM algorithm is computationally expensive: it scales quadratically with the number of genes and linearly with the number of cells/samples.
- Therefore, the authors apply the EM-in-chunks algorithm with chunk size 500 throughout the rest of the paper.
- In the E step, the authors calculate the expectation of the log likelihood function for complete data.
- Ten replicates were performed for each setting on an Intel Xeon E5-2670 2.6 GHz CPU.
Did you find this useful? Give us your feedback
Related Papers (5)
Frequently Asked Questions (2)
Q1. What have the authors contributed in "Controlling for confounding effects in single cell rna sequencing studies using both control and target genes" ?
Here, the authors present a novel statistical method, which they refer to as scPLS ( single cell partial least squares ), for robust and accurate inference of confounding effects. ScPLS takes advantage of the fact that genes in a scRNAseq study often can be naturally classified into two sets: a control set of genes that are free of effects of the predictor variables and a target set of genes that are of primary interest. With extensive simulations and comparisons with other methods, the authors demonstrate the effectiveness of scPLS. Finally, the authors apply scPLS to analyze two scRNAseq data sets to illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects.
Q2. What are the future works mentioned in the paper "Controlling for confounding effects in single cell rna sequencing studies using both control and target genes" ?
Exploring the use of biological factors in scPLS is an interesting avenue for future research. Therefore, it would be important to evaluate the performance of scPLS in other analysis settings in future studies. Therefore, extending their framework to modeling count data65,66 is another promising avenue for future research. One potential disadvantage of scPLS is that it does not model raw count data directly.