Question

fgsea: What does fgseaMultilevel argument sampleSize mean/when to change it?

4

Entering edit mode

3.9 years ago

kelen ▴ 210

Hi!

Fgsea has recently moved to using fgseaMultilevel by default and from what I understand does not use sampling for p-value calculations, but instead an "adaptive multilevel splitting Monte Carlo approach. " (from the documentation). I have been trying to understand some of the new arguments included when running fgsea, particularly sampleSize

(sampleSize - The size of a random set of genes which in turn has size = pathwaySize)

fgseaMultilevel(
  pathways,
  stats,
  sampleSize = 101,
  minSize = 1,
  maxSize = Inf,
  eps = 1e-10,
  scoreType = c("std", "pos", "neg"),
  nproc = 0,
  gseaParam = 1,
  BPPARAM = NULL,
  nPermSimple = 1000,
  absEps = NULL
)

I have been wrestling with the pre-print (https://www.biorxiv.org/content/10.1101/060012v2.full) and all I can gather (might not be accurate) is that this

has to be an odd number
it can't be smaller than 3 (that is from their code https://github.com/ctlab/fgsea/blob/master/R/fgseaMultilevel.R)
defines the number of gene sets used to calculate probabilities?

But the concept of this sampleSize number still remains quite abstract to me (why is it 101 by default), yet it looks very important and is easily changeable (not very foolproof if someone like me plays around with the parameters). And the definition states ...size = pathwaySize (confused)?

Basically I am hoping for any input on what does it do in simple terms and when should I change it? The results do vary quite a bit if I try say 11 vs 101 vs 1001 (in terms of the number of significantly enriched pathways). If it's something similar to sampling random gene sets for P-value calculation (which I think is the classical approach) then I am inclined to think this IS something that should potentially be tailored to suit specific needs...

Please help, thanks.

Yours truly,

Desperate For Help

fgsea gsea functional enrichment analysis RNA-Seq • 5.2k views

ADD COMMENT • link 3.9 years ago by kelen ▴ 210

score 6 · Accepted Answer · 2020-12-18

This is a parameter that control estimation accuracy, somewhat similar to nperm in fgseaSimple. The higher is the value, the more accurate are the results and the slower (proportionally) it works.

To understand how it's related to accuracy you can use multilevelError(pval, sampleSize) function, which tells what would be the estimation error for the given true p-value and used sampleSize:

> multilevelError(1e-15, 101)
[1] 1.017545

The result is on log2 scale, so for the above example if the true p-value is 1e-15, fgseaMultilevel with the default sampleSize=101 will give you an estimation from 4*1e-15 to 0.25*1e-15 with a 95% probability.

When should you want to change it? Likely never, unlike with nperm and fgseaSimple, no sampleSize value limits how small P-values can be estimated (this can be limited by eps argument). The default value of 101 proved to be a good compromise between speed and accuracy. Again, for the given example, I don't know a practical case, when you want to accurately distinguish 1e-15 and 2e-15, while 1e-15 will be distinguishable from 1e-16 and 1e-14. However, if you scale sampleSize down to 25 you'll get a 4-fold speed improvement, but the log2 error for 1e-15 will be 2, which means that your CI will be from 16*1e-15 to 1/16*1e-15.