Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Q: How can the authors extend the approach used in DESeq2 to isoform specific analysis?

In addition, the approach used in DESeq2 can be extended to isoformspecific analysis, either through generalized linear modeling at the exon level with a gene-specific mean as in the DEXSeq package [30] or through counting evidence for alternative isoforms in splice graphs [31,32].

Q: What is the way to replace an outlier?

As the outlier is replaced with the value predicted by the null hypothesis of no differential expression, this is a more conservative choice than simply omitting the outlier.

Q: What is the significance of a thresholded test?

Figure 4A demonstrates how such a thresholded test gives rise to a curved decision boundary: to reach significance, the estimated LFC has to exceed the specified threshold by an amount that depends on the available information.

Question

Q1. What are the contributions mentioned in the paper "Moderated estimation of fold change and dispersion for rna-seq data with deseq2" ?

Q2. What was used to compare a hierarchical clustering with the true cluster membership?

Q3. What is the main reason why the methods that treat each gene separately suffer from lack of power?

Q4. What is the disadvantage of the rlog transformation with respect to the VST?

Q5. What is the way to remove outliers from subsequent analysis?

Q6. Why is the Wald test used in multiple testing?

Q7. What are the use cases of DESeq2?

Q8. What was the expected result of the permutation-based SAMseq method?

Q9. How can the authors extend the approach used in DESeq2 to isoform specific analysis?

Q10. What is the way to replace an outlier?

Q11. What is the significance of a thresholded test?

Q12. What is the difference between the LFC estimates and the mean?

Q13. What is the null hypothesis for DESeq2?

Q14. What is the difference between the rlog transformation and the VST?

Q15. What can be done to improve the results of DESeq2?

Accepted Answer

The authors present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. An important task here is the analysis of RNA sequencing ( RNA-seq ) data with the aim of finding genes that are differentially expressed across groups of samples. Tum. de 2Genome Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany Full list of author information is available at the end of the article Many methods for differential expression analysis of RNA-seq data perform such information sharing across genes for variance ( or, equivalently, dispersion ) estimation. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http: //creativecommons. org/licenses/by/4. 0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( http: //creativecommons. org/publicdomain/zero/1. 0/ ) applies to the data made available in this article, unless otherwise stated. Here the authors present DESeq2, a successor to their DESeq method [ 4 ]. The authors demonstrate the advantages ofDESeq2 ’ s new features by describing a number of applications possible with shrunken fold changes and their estimates of standard error, including improved gene ranking and visualization, hypothesis tests above and below a threshold, and the regularized logarithm transformation for quality assessment and clustering of overdispersed count data. Note that although the authors refer in this paper to counts of reads in genes, the methods presented here can be applied as well to other kinds of HTS count data. For each gene, the authors fit a generalized linear model ( GLM ) [ 12 ] as follows. The authors model read counts Kij as following a negative binomial distribution ( sometimes also called a gamma-Poisson distribution ) withmeanμij and dispersion αi. The use of linearmodels, however, provides the flexibility to also analyze more complex designs, as is often useful in genomic studies [ 15 ]. The authors here explain the concepts of their approach using as examples a dataset by Bottomly et al. [ 16 ] with RNA-seq data formice of two different strains and a dataset by Pickrell et al. [ 17 ] with RNA-seq data for human lymphoblastoid cell lines. Next, the authors determine the location parameter of the distribution of these estimates ; to allow for dependence on average expression strength, they fit a smooth curve, as shown by the red line in Figure 1. This provides an accurate estimate for the expected dispersion value for genes of a given expression strength but does not represent deviations of individual genes from this overall trend. Their approach therefore accounts for gene-specific Love et al. The black points circled in blue are detected as dispersion outliers and not shrunk toward the prior ( shrinkage would follow the dotted line ). Variation to the extent that the data provide this information, while the fitted curve aids estimation and testing in less information-rich settings. Their approach is similar to the one used by DSS [ 6 ], in that both methods sequentially estimate a prior distribution for the true dispersion values around the fit, and then provide the maximum a posteriori ( MAP ) as the final estimate. The authors reasoned that in many cases, the reason for extraordinarily high dispersion of a gene is that it does not obey their modeling assumptions ; some genes may showmuch higher variability than others for biological or technical reasons, even though they have the same average expression levels. The authors demonstrate this issue using the dataset by Bottomly et al. [ 16 ]. The authors again employ an empirical Bayes procedure: they first perform Love et al. The stronger curvature of the green posterior at its maximum translates to a smaller reported standard error for the MAP LFC estimate ( horizontal error bar ). Genes with high information for LFC estimation will have, in their approach, LFCs with both low bias and low variance. Furthermore, as the degrees of freedom increase, and the experiment provides more information for LFC estimation, the shrunken estimates will converge to the unshrunken estimates. To demonstrate this, the authors split the Bottomly et al. samples equally into two groups, I and II, such that each group contained a balanced split of the strains, simulating a scenario where an experiment ( samples in group I ) is performed, analyzed and reported, and then independently replicated ( samples in group II ). This makes shrunken LFCs also suitable for ranking genes, e. g., to prioritize them for follow-up experiments. For example, if the authors sort the genes in the two sample groups of Figure 3 by unshrunken LFC estimates, and consider the 100 genes with the strongest upor down-regulation in group I, they find only 21 of these again among the top 100 upor down-regulated genes in group II. However, if the authors rank the genes by shrunken LFC estimates, the overlap improves to 81 of 100 genes ( Additional file 1: Figure S3 ). The authors demonstrate this in the Benchmarks section below. However, the loss can be reduced if genes that have little or no chance of being detected as differentially expressed are omitted from the testing, provided that the criterion for omission is independent of the test statistic under the null hypothesis [ 22 ] ( see Methods ). For well-powered experiments, however, a statistical test against the conventional null hypothesis of zero LFC may report genes with statistically significant changes that are so weak in effect strength that they could be considered irrelevant or distracting. However, this approach loses the benefit of an easily interpretable FDR, as the reported P value and adjusted P value still correspond to the test of zero LFC. It is therefore desirable to include the threshold in the statistical testing procedure directly, i. e., not to filter post hoc on a reported fold-change estimate, but rather to evaluate statistically directly whether there is sufficient evidence that the LFC is above the chosen threshold. The authors note that related approaches to generate gene lists that satisfy both statistical and biological significance criteria have been previously discussed for microarray data [ 23 ] and recently for sequencing data [ 19 ]. For such analyses, DESeq2 offers a test of the composite null hypothesis |βir| ≥ θ, which will report genes as significant for which there is evidence that their LFC is weaker than θ. As the aim of differential expression analysis is typically to find consistently upor down-regulated genes, it is useful to consider diagnostics for detecting individual observations that overly influence the LFC estimate and P value for a gene. While the original fitted means are heavily influenced by a single sample with a large count, the corrected LFCs provide a better fit to the majority of the samples. When the authors consider the variance of each gene, computed across samples, these variances are stabilized – i. e., approximately the same, or homoskedastic – after the rlog transformation, while they would otherwise strongly depend on the mean counts. Note that while the rlog transformation builds upon on their LFC shrinkage approach, it is distinct from and not part of the statistical inference Love et al. This is in contrast to the variance-stabilizing transformation ( VST ) for overdispersed counts introduced in DESeq [ 4 ]: while the VST is also effective at stabilizing variance, it does not directly take into account differences in size factors ; and in datasets with large variation in sequencing depth ( dynamic range of size factors 4 ) the authors observed undesirable artifacts in the performance of the VST. Both the rlog transformation and the VST are provided in the DESeq2 package. The authors demonstrate the use of the rlog transformation on the RNA-seq dataset of Hammer et al. [ 26 ], wherein RNA was sequenced from the dorsal root ganglion of rats that had undergone spinal nerve ligation and controls, at 2 weeks and at 2 months after the ligation. Love et al. Genome Biology ( 2014 ) 15:550 Page 2 of 21 further investigation. Furthermore, the number of genes called significantly differentially expressed depends as much on the sample size and other aspects of experimental design as it does on the biology of the experiment – and well-powered experiments often generate an overwhelmingly long list of hits [ 9 ]. The authors furthermore compare DESeq2 ’ s statistical power with existing tools, revealing that their methodology has high sensitivity and precision, while controlling the false positive rate. However, it can be advantageous to calculate gene-specific normalization factors sij to account for further sources of technical biases such as differing dependence on GC content, gene length or the like, using published methods [ 13,14 ], and these can be supplied instead. The shrinkage procedure thereby helps avoid potential false positives, which can result from underestimates of dispersion. Furthermore, a standard error for each estimate is reported, which is derived from the posterior ’ s curvature at its maximum ( see Methods for details ).

Accepted Answer

The adjusted Rand index [37] was used to compare a hierarchical clustering based on various distances with the true cluster membership.

Accepted Answer

Inferential methods that treat each gene separately suffer here from lack of power, due to the high uncertainty of within-group variance estimates.

Accepted Answer

A disadvantage of the rlog transformation with respect to the VST is, however, that the ordering of genes within a sample will change if neighboring genes undergo shrinkage of different strength.

Accepted Answer

By default, outliers in conditions with six or fewer replicates cause the whole gene to be flagged and removed from subsequent analysis, including P value adjustment for multiple testing.

Accepted Answer

Due to the large number of tests performed in the analysis of RNA-seq and other genome-wide experiments, the multiple testing problem needs to be addressed.

Accepted Answer

Its use cases are not limited to RNA-seq data or other transcriptomics assays; rather, many kinds of high-throughput count data can be used.

Accepted Answer

It was expected that the permutation-based SAMseq method would rarely produce adjusted P value < 0.1 in the evaluation set, because the three vs three comparison does not enable enough permutations.

Accepted Answer

In addition, the approach used in DESeq2 can be extended to isoformspecific analysis, either through generalized linear modeling at the exon level with a gene-specific mean as in the DEXSeq package [30] or through counting evidence for alternative isoforms in splice graphs [31,32].

Accepted Answer

As the outlier is replaced with the value predicted by the null hypothesis of no differential expression, this is a more conservative choice than simply omitting the outlier.

Accepted Answer

Figure 4A demonstrates how such a thresholded test gives rise to a curved decision boundary: to reach significance, the estimated LFC has to exceed the specified threshold by an amount that depends on the available information.

Accepted Answer

the estimates are more evenly spread around zero, and for very weakly expressed genes (with less than one read per sample on average), LFCs hardly deviate from zero, reflecting that accurate LFC estimates are not possible here.

Accepted Answer

if any biological processes are genuinely affected by the difference in experimental treatment, this null hypothesis implies that the gene under consideration is perfectly decoupled from these processes.

Accepted Answer

Figure 5 provides diagnostic plots of the normalized counts under the ordinary logarithm with a pseudocount of 1 and the rlog transformation, showing that the rlog both stabilizes the variance through the range of the mean of counts and helps to find meaningful patterns in the data.

Accepted Answer

if estimates for average transcript length are available for the conditions, these can be incorporated into the DESeq2 framework as gene- and sample-specific normalization factors.

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

Figures

Citations