Background
DESeq2 uses independent filtering to filter out low read genes from analysis. Nevertheless it is customary to pre-filter ultra low read genes(usually rowsum(gene) > 10 read is the basic condition), this reduces load on DESeq2, but it also gives us fewer false positives because genes with low reads have a much higher rate of error for given p-values when only one group has reads. I'm sure you are all familiar with this.
I wasn't sure weather to apply this to salmon, but I did and it gives me better results. Now I am filtering salmon read counts with below 6 reads on average before putting them into DESeq2 for DGE
But I checked the probability distribution of the counts, both for STAR and SALMON aligned count tables it shows a massive amount of noise for counts all the way to 20 counts per average.
Question
What is causing the spikes in density for very low counts(below 20) for both salmon and STAR? See in images attached.
Note that both gene counts are made from the same FASTQ raw data.
My personal guess is that it's a problem of sampling frequency, since reads are somewhat quantized, as you approach 0, the "bit size" becomes smaller, and you get oscillations, something like Nyquist limit. I haven't tried to verify this. But I have found there is always a spike of probability to find 0, 1 or 2 reads per gene.
Although this is occasionally done, I would not say it is customary, at least for DESeq2.
It depends I would say. Mike Love, the author of
DESeq2
does not explicitely recommend it and says in the vignette that it is typically not necessary. In contrasts theedgeR
maintainers explicitely recommend it by usingFilterByExpr
on the count matrix.The post only mentioned DESeq2, so I should've clarified.
But yes, for different tools, the common workflows will vary.
Isnt that normal since a large number of genes is not or barely-expressed and therefore inflates the density for 0/1/2? For the higher counts you basically have the span from non-low counts to infinity, and for non-expressed genes you have 0 to a small number like 10 or so. I do not find this surprising. Did you use tximport (unrelated to this question, just asking)?
That's what I was thinking is the reason. However tximport counts are not strictly quantized because salmon uses statistical inference to "estimate" the couts, however, the reads near 0 are mostly quatized to integers. I do not understand fully the implications of this, or why it's mostly quantized for some reads but not for others
I did use tximport yes
You are seeing the ones near 0 because you are on a log-scale. STAR counts are integers and you can see the curve starts to look smooth very quickly.
Also, a small adjustment with low counts will be more likely to give you the same number when rounded to the nearest integer. For example, 0.99 * 1 is still very close to 1, but 0.99 * 100 is 99 (a whole integer away).