Question

What causes strange oscillations in the count distribution of Salmon and STAR aligned gene tables?

0

Entering edit mode

5.4 years ago

Gabriel ▴ 180

Background

DESeq2 uses independent filtering to filter out low read genes from analysis. Nevertheless it is customary to pre-filter ultra low read genes(usually rowsum(gene) > 10 read is the basic condition), this reduces load on DESeq2, but it also gives us fewer false positives because genes with low reads have a much higher rate of error for given p-values when only one group has reads. I'm sure you are all familiar with this.

I wasn't sure weather to apply this to salmon, but I did and it gives me better results. Now I am filtering salmon read counts with below 6 reads on average before putting them into DESeq2 for DGE

But I checked the probability distribution of the counts, both for STAR and SALMON aligned count tables it shows a massive amount of noise for counts all the way to 20 counts per average.

Question

What is causing the spikes in density for very low counts(below 20) for both salmon and STAR? See in images attached. count distribution of salmon aligned reads count distribution of STAR aligned reads

Note that both gene counts are made from the same FASTQ raw data.

My personal guess is that it's a problem of sampling frequency, since reads are somewhat quantized, as you approach 0, the "bit size" becomes smaller, and you get oscillations, something like Nyquist limit. I haven't tried to verify this. But I have found there is always a spike of probability to find 0, 1 or 2 reads per gene.

salmon STAR counts filtering HTSeq • 2.0k views

ADD COMMENT • link 5.4 years ago by Gabriel ▴ 180

1

Entering edit mode

it is customary to pre-filter ultra low read genes

Although this is occasionally done, I would not say it is customary, at least for DESeq2.

ADD REPLY • link 5.4 years ago by igor 13k

0

Entering edit mode

It depends I would say. Mike Love, the author of DESeq2 does not explicitely recommend it and says in the vignette that it is typically not necessary. In contrasts the edgeR maintainers explicitely recommend it by using FilterByExpr on the count matrix.

ADD REPLY • link 5.4 years ago by ATpoint 89k

0

Entering edit mode

The post only mentioned DESeq2, so I should've clarified.

But yes, for different tools, the common workflows will vary.

ADD REPLY • link 5.4 years ago by igor 13k

0

Entering edit mode

Isnt that normal since a large number of genes is not or barely-expressed and therefore inflates the density for 0/1/2? For the higher counts you basically have the span from non-low counts to infinity, and for non-expressed genes you have 0 to a small number like 10 or so. I do not find this surprising. Did you use tximport (unrelated to this question, just asking)?

ADD REPLY • link 5.4 years ago by ATpoint 89k

0

Entering edit mode

That's what I was thinking is the reason. However tximport counts are not strictly quantized because salmon uses statistical inference to "estimate" the couts, however, the reads near 0 are mostly quatized to integers. I do not understand fully the implications of this, or why it's mostly quantized for some reads but not for others

Un-normalized count table from salmon

Un-normalized count table from salmon, low reads

I did use tximport yes

ADD REPLY • link 5.4 years ago by Gabriel ▴ 180

0

Entering edit mode

the reads near 0 are mostly quatized to integers

You are seeing the ones near 0 because you are on a log-scale. STAR counts are integers and you can see the curve starts to look smooth very quickly.

Also, a small adjustment with low counts will be more likely to give you the same number when rounded to the nearest integer. For example, 0.99 * 1 is still very close to 1, but 0.99 * 100 is 99 (a whole integer away).

ADD REPLY • link 5.4 years ago by igor 13k