High discrepancy between Primary estimate and Bootstraps from Salmon
1
3
Entering edit mode
2.2 years ago

Hi everyone,

i am currently trying to do transcript level differential expression analysis using the output from Salmon and the swish method from the fishpond library. I have observed something quite interesting that is potentially messing up my data analysis. For some transcripts, i get 0 counts in the primary estimates from the quant.sf file. However, after running swish, those same transcripts are identified as highly differentially expressed, and plotting the count values using plotinfreps shows count values way larger than 0.

Is it correct to think this might be due to very high uncertainty in mapping from Salmon which then yields to 0 counts in primary estimates VS very high values in infreps from boostraps ?

If yes, would it be best practice to filter out those kind of genes from the analysis and do oyu have any recommendation on how to do it ?

Any inputs are appreciated

salmon swish uncertainty dishpond • 1.2k views
ADD COMMENT
0
Entering edit mode

I did some further analysis and this may actually be more of a purely salmon related question. I will take the example of a specific transcript from my data.

The quant.sf file from salmon outputs the following data :

Name    Length  EffectiveLength TPM NumReads

ENST00000525340.5   2496    2316.000    0.000000    0.000

However, after running the ConvertBootstrapsToTSV.py script on my bootstraps and extracting the count values for this same transcript from the tsv file, i get the following counts :

Bootstrap   ENST00000525340.5

1   7539.44707482287
2   7732.65780460576
3   8469.60118056819
4   8087.06880760676
5   7413.52524661232
6   7377.02918954835
7   7438.47695879207
8   7215.07853878101
9   7853.29787241109
10  7615.02734898884
11  7848.62401555959
12  7802.97547868047
13  7754.53423687043
14  7955.98627933994
15  7383.45404577626
16  7646.01377726913
17  7868.73284501966
18  7778.15644776701
19  8375.07752442173
20  8269.71586862337

So as you can see, there is some variations between the different bootstraps but the value in the boostraps is far from 0 compared to the primary estimate in the quant.sf file.

I ran salmon 1.4.0 with the following command :

 salmon quant \
 -p 8 \
 -i /salmon__idx/ \
 -l A \
 -1 /R1_001.fastq.gz \
 -2 /R2_001.fastq.gz \
 --numBootstraps 20 \
 --seqBias \
 --gcBias \
 --dumpEq \
 -o /output_folder/

Any idea why this is happening ?

ADD REPLY
3
Entering edit mode
2.2 years ago
Rob 6.9k

Hi Lucas,

That is, indeed, a very big difference between the point and bootstrap estimates. Could you try modifying your salmon command to include the --useEM flag and see if the point estimate changes at all? That is, use:

 salmon quant \
 -p 8 \
 -i /salmon__idx/ \
 -l A \
 -1 /R1_001.fastq.gz \
 -2 /R2_001.fastq.gz \
 --numBootstraps 20 \
 --seqBias \
 --gcBias \
 --dumpEq \
 --useEM \
 -o /output_folder/

This will use the "standard" EM rather than the variational Bayesian EM, which tends to produce sparser solutions — at least at the standard parameters.

ADD COMMENT
1
Entering edit mode

Hi Rob,

thanks for taking the time to look at my issue. So i re-ran salmon with the --useEM flag and checked a few genes for which i was observing major discrepancies. It fixed the issue, i now have a good match between the bootstraps and the estimate, for zero as well as non-zero expression values that were very different with the VBEM algorithm !

I took some time to read the section on the --useEM flag in the Salmon documentation to understand how it affects the results, somehow beyond my reach for now but i guess i'll have to go read the paper on VBEM. Anyway, the bootstraps seem to be converging nicely at 20 iterations and i may get some trade-offs in accuracy but it does feel better to have zeroes in both files now and no such high discrepancies anymore.

Lucas

ADD REPLY
1
Entering edit mode

Hi Lucas,

I'm glad that this was able to address your issue. There are several technical differences between the standard EM and the VB EM, but, intuitively the biggest difference is that they are optimizing a slightly different objective. One big difference is that the VB EM has an explicitly tunable sparsity prior, which previous work has shown can sometimes be beneficial. However, the default priors promote sparsity, which can lead to more zeros in the solutions. The reason this may cause a discrepancy between the point estimate and bootstraps is perhaps less obvious, but I suspect is related to this.

Rob

ADD REPLY

Login before adding your answer.

Traffic: 1304 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6