Digital expression measurements (e.g. RNA-seq) are often used to determine the change of quantities upon some treatment or stimulus. The resulting value of interest is the fold change (often logarithmized).
This effect size of the change is often treated as a value that can be computed as \(lfc(A,B)=\log_2 \frac{A}{B}\). However, due to the probabilistic nature of the experiments, the effect size rather is a random variable that must be estimated. This fact becomes obvious when considering that \(A\) or \(B\) can be 0, even if the true abundance is non-zero.
We have shown that this can be modelled in a Bayesian framework [1,2]. The intuitively computed effect size is the maximum likelihood estimator of a binomial model, where the effect size is not represented as fold change, but as proportion (of note, the log fold change simply is the logit transformed proportion). The Bayesian prior corresponds to pseudocounts frequently used to prevent infinite fold changes by \(A\) or \(B\) being zero. Furthermore, the Bayesian framework offers more advanced estimators (e.g. interval estimators or the posterior mean, which is the optimal estimator in terms of squared errors).
This R package offers the implementation to harness the power of this framework.
The most basic function to estimate effect sizes is
PsiLFC(A,B)
\(A\) and \(B\) are vectors of counts, corresponding to the n genes in conditions \(A\) and \(B\) (i.e. they are columns from the normal count matrices). What PsiLFC does it to obtain reasonable pseudocounts (see next section), compute the posterior mean for each entry, and then output normalized (i.e. median-centered) effect sizes in the log\(_2\) fold change representation.
Let’s consider a real world example:
library(DESeq2)
library(lfc)
data(airway, package="airway")
assay(airway)[,2]
A <- assay(airway)[,1]
B <-
PsiLFC(A,B)
ll <-plot(ecdf(ll),xlim=c(-1,1),xlab="Log2 fold change treated/untreated",ylab="Cumulative frequency",main="Cell N61311",col='blue')
lines(ecdf(CenterMedian(log2((A+1)/(B+1)))),col='red')
This shows the log fold change distributions estimated by the posterior mean (blue) and the intuitive way using pseudocounts of 1 (red). This distribution is heavily distorted by several genes with no reads in any of the two conditions. Interestingly, the intuitive way of computing effect sizes results in an asymmetric distribution with more downregulated than upregulated genes.
plot(ecdf(ll[A>0 | B>0]),xlim=c(-1,1),xlab="Log2 fold change treated/untreated",ylab="Cumulative frequency",main="Cell N61311",col='blue')
lines(ecdf(CenterMedian(log2((A+1)/(B+1))[A>0 | B>0])),col='red')
Here, only genes are considered that have at least one read in one of the two conditions. The intuitive fold changes still appear to overestimate changes, as well as show more artifacts.
It is also possible to directly estimate effect sizes on SummarizedExperiment objects:
head(PsiLFC.se(airway,contrast=c("dex","untrt","trt")))
## ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460
## 0.33634752 0.00000000 -0.21035628 -0.04804218 0.10586343
## ENSG00000000938
## 0.89037506
Also, this package provides a drop-in replacement for DESeq2’s results
function:
DESeqDataSetFromMatrix(countData = assay(airway),colData = colData(airway),design= ~ dex)
dds <- DESeq(dds)
dds <- results(dds, contrast=c("dex","untrt","trt"),cre=TRUE)
res <-head(res)
## DataFrame with 6 rows and 9 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 708.602170 0.3788470 0.173141 2.188082 0.0286636
## ENSG00000000005 0.000000 NA NA NA NA
## ENSG00000000419 520.297901 -0.2037604 0.100599 -2.025478 0.0428183
## ENSG00000000457 237.163037 -0.0340428 0.126279 -0.269584 0.7874802
## ENSG00000000460 57.932633 0.1171786 0.301237 0.388992 0.6972820
## ENSG00000000938 0.318098 1.7245505 3.493633 0.493627 0.6215698
## padj PsiLFC Credible 0.05 Credible 0.95
## <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 0.139383 0.3363475 0.273959 0.3988301
## ENSG00000000005 NA 0.0000000 -1.817845 1.8398897
## ENSG00000000419 0.183458 -0.2103563 -0.282880 -0.1378499
## ENSG00000000457 0.930540 -0.0480422 -0.155755 0.0597279
## ENSG00000000460 0.895428 0.1058634 -0.112729 0.3250639
## ENSG00000000938 NA 0.8903751 -0.665848 2.5667235
Important: To make this work, load lfc
after DESeq2
!
Note that here we also computed the 90% credible interval. The parameter cre
can also be given to PsiLFC
or PsiLFC.se
!