Question

How does salmon uses polyester?

0

Entering edit mode

19 months ago

shinyjj ▴ 50

Hi Salmon users or developers! In salmon paper, to evaluate the ground truth, it uses RSEM to a certain data and uses polyester to the output of RSEM.

Why use polyester on the output of RSEM?
In salmon github repo, can somebody direct me to the script that uses polyester with all its different parameter? Polyetser is a R package but I don’t see any R in salmon github repo. I want to see how salmon uses polyester.

salmon • 979 views

ADD COMMENT • link updated 19 months ago by Michael Love ★ 2.6k • written 19 months ago by shinyjj ▴ 50

score 1 · Answer 1 · 2023-05-06

1

Entering edit mode

19 months ago

Rob 6.9k

Polyester is a read simulator, not a quantification or analysis tool. In the salmon paper, what was done was to take experimental (i.e. real) RNA-seq samples, and to quantify them with RSEM. Then, the abundances learned by RSEM were used to seed Polyester to produce simulated data. Those simulated data then have known ground truth. From this, it is possible to assess the quantification accuracy on simulated data using different quantification methods.

If you're interested in simulating data using Polyester, the Polyester package itself has some nice documentation. You can also find, in this repo, made by Mike Love, scripts for simulating data using Polyester that was used in this paper.

However, to be clear, in a standard run of salmon (e.g. when you are quantifying transcript abundance in new samples), Polyester is not used at all and is not part of the typical salmon-based quantification pipeline.

ADD COMMENT • link 19 months ago by Rob 6.9k

0

Entering edit mode

Thanks Rob! One question is I don’t see hexamer bias in the code or polyester documentation or maybe I missed something? When you were using polyester, do you recall adding hexamer bias, if not, how much hexamer bias is crucial in rna seq data? To me hexamer bias is affected by gc bias, so if you already have gc bias setup in polyester, it’s ok.

ADD REPLY • link 19 months ago by shinyjj ▴ 50

0

Entering edit mode

In alpine and Salmon papers we only simulated the fragment GC bias. We isolated the fragment GC bias from the hexamer bias using alpine, as mentioned here:

Note that the GC content curves used in the simulation (Fig. 2e) were estimated after removing the read start bias using the Cufflinks VLMM, and so the fragment sequence effect observed is not a proxy for read start bias.

https://www.nature.com/articles/nbt.3682#Sec2

I don't think Polyester has functionality for hexamer / read start bias, but in our analyses in the alpine paper, this was not a contributing factor to across-sample variation in the studies we looked at, including the IVT-seq, GEUVADIS, ABRF, or SEQC. The driving factor that lead to mis-estimation of isoform abundance (e.g. the cases of mis-identification of dominant isoform) was driven by fragment-level biases arising most likely from PCR amplification steps varying across library preparation batches.

ADD REPLY • link 19 months ago by Michael Love ★ 2.6k