TLDR: the fact that log fold changes get smaller, and that this happens more for smaller counts, is a feature not a bug of adding a pseudocount.

Question

Discrepancy between Log2(x) and Log2(x+1) regarding Log2FC

2

Entering edit mode

2.4 years ago

schulpen_91 ▴ 30

Often in DE-analysis count values (FPKM or TPM) are log transformed with pseudocounts such as Log2(x+1) or Log2(x+0.1), which is done to avoid negative values. Alas, I have noticed a discrepancy I can't get my head around.

Suppose we have two expression values: 30 and 60.

Using normal values, Log2FC = log2(60) - log2(30) = 1, or, Log2FC = 5.91 - 4.91 = 1. However, using log transformed values with a pseudocount of x+1 will give you Log2FC = 5.93 - 4.95 = ~0.97.

This issue becomes bigger with smaller values such as 3 and 6, where Log2FC = log2(6) - log2(3) = 1, but Log2FC = Log2(6 + 1) - log2(3 + 1) = 0.8. Larger distances between small values increase the issue, with Log2FC = log2(6) - log2(1) = 2.5 but Log2FC = log2(6 + 1) - log2(1 + 1) = 1.5.

Placing the pseudocounts outside the Log2 formula (e.g., Log2(x) + 1) will yield the desired Log2FC = 1 in this case, but this is mathematically incorrect since Log2(0+1) = 0 but Log2(0) + 1 is impossible to calculate.

These differences make sense, but I never see it addressed in literature regarding DE-analysis when pseudocounts are used, while such differences in fold change with and without pseudocounts would seem pretty significant to me.

Also, my apologies if this has been addressed before or if I just made a stupid calculation error. I have been staring at my screen for 26 hours now.

statistics • 4.3k views

ADD COMMENT • link updated 2.4 years ago by i.sudbery 21k • written 2.4 years ago by schulpen_91 ▴ 30

score 3 · Answer 1 · 2023-03-10

I think there is a confusion of several things here.

Firstly note that FPKM and TPM are not counts.

The log1p transform - i.e. log2(x+1), if often used when visualising counts (not TPM or FPKM - it may be used when calculating CPM, but then the count is added _before_ being normalised to library size), where it avoids the problem of it being impossible to calculate log2(0). It is also sometimes used when manually calculating log2fold changes for the same reason, it is not generally used in the log2fold changes that come out of DE tools. 1 is often a good pseudocount to use with counts, because it is the smallest whole count you can add - for a measure that needs to stay integer (as counts do), you can't add a smaller amount - it is also generally at least one (if not more) order of magnitude smaller than the average number of counts, and so will make little difference for the position of the average points. It also has the helpful propertly that the smallest possible value on the count scale (0) is 0 of the transformed scale. Both of these properties are helpful for plotting barcharts, scatterplots and heatmaps. There are cases where 1 isn;t the correct pseduocount, such as when expected untransformed values are low (and TPMs and FPKMs may be such a situation). DESeq2's rlog function tries to extend the log1p concept by choosing pseudocounts in a principled manner, rather than just using 1.

Making it possible to plot 0 on a log graph is a special case of something called regularization/shrinkage - we shrink the size of effects where we have less evidence for them. Adding 1 to our counts makes a count 0 zero closer to a count of 1 on a log scale: without the the pseudocount, a count of Log2(0) is -Inf which is a very long way from a log(1)=0. However, when we add a pseudocount, they are much closer together Log2(0+1)=0, Log2(1+1)=1. As the numbers get higher, however, the effect gets smaller. The effect of adding 1 to 100 on the log scale is way smaller than the effect of adding 1 to 10. The effect of adding 1 to a set of numbers is to reduce the variance between small numbers, while leaving the variance of large numbers uneffected. Since when we are dealing with counts, the coefficient of variance is larger for smaller numbers, it is known as a variance stabilising transform.

This becomes particularly important when we talk about log2fold changes. Consider having 2 counts in treatment and 1 counts in control. This is a log2fold change of 1 (log2(2) - log2(1) = 1 - 0 = 1). However, imagine that we had sequenced just one fewer reads in control (lfc = log(2) - log(0) = Inf) or one more read in control (lfc = (log(2) - log(2) = 0). The is a massive difference in lfc for just a change of 1 in counts. Now consider having 64 counts in treatment and 32 counts in control. The LFC is the same. But now consider the effect of changing the number of reads in control by 1 either way - there is very little difference. This is why MA plots for RNAseq experiments look like trianges, with lots of variation at low expression levels, and little at high expression levels.

When we add a pseudocount we "shrink the LFCs towards 0", the amount of shrinkage that we apply is proportional to the strength of evidence we have for the LFC. The change in LFC for 2 counts vs 1 count is large because we didn't have much evidence for the change in the first place. The change in LFC for 64 vs 32 is much smaller because we had much better evidence for a 2 fold difference. This means your list of genes with high fold changes won't be dominated by lowly expressed genes.

Effectively we are doing a Bayesian analysis, where we have a prior that the log fold change is 0 and the strength of the prior is encoded in the size of the pseudocount. For principled ways of choosing the size of the pseudocount, see "emprical bayes". For a good explaination for this see http://varianceexplained.org/r/empirical_bayes_baseball/, which uses the beta-binomial model as an example which is easier to consider than the gamma-poisson model (i.e. negative binomial) used in RNA-seq analysis, but the concepts are similar.

TLDR: the fact that log fold changes get smaller, and that this happens more for smaller counts, is a feature not a bug of adding a pseudocount.

score 0 · Answer 2 · 2023-03-09

You are correct to point this out, and it is a known problem for these kinds of analyses. More complex analytical methods like mixtures of distributions or Bayesian techniques are often recommended over use of simple pseudocounts for the reason you pointed out. This manuscript discusses some of these reasons in the Results section and offers some exploration of Bayesian approaches to mitigate that effect.

https://academic.oup.com/bioinformatics/article/34/23/4054/5040306

score 0 · Answer 3 · 2023-03-09

The point of this transformation is to avoid zeros. You can add a really small number (0.001 - 0.1) if you are worried about the pseudocount effect. Like you noticed, the effect of pseudocounts is more pronounced if you have a large fraction of genes with relatively small counts (say, <100), but it makes little difference if most of your counts are in hundreds or thousands. If you don't want to look into your data distribution before deciding, simply use 0.1 as a pseudocount. There is no compelling reason to use x+1 rather than x+0.1, and in most cases it doesn't matter much one way or the other.