Hi!
Fgsea has recently moved to using fgseaMultilevel by default and from what I understand does not use sampling for p-value calculations, but instead an "adaptive multilevel splitting Monte Carlo approach. " (from the documentation). I have been trying to understand some of the new arguments included when running fgsea, particularly sampleSize
(sampleSize
- The size of a random set of genes which in turn has size = pathwaySize)
fgseaMultilevel(
pathways,
stats,
sampleSize = 101,
minSize = 1,
maxSize = Inf,
eps = 1e-10,
scoreType = c("std", "pos", "neg"),
nproc = 0,
gseaParam = 1,
BPPARAM = NULL,
nPermSimple = 1000,
absEps = NULL
)
I have been wrestling with the pre-print (https://www.biorxiv.org/content/10.1101/060012v2.full) and all I can gather (might not be accurate) is that this
- has to be an odd number
- it can't be smaller than 3 (that is from their code https://github.com/ctlab/fgsea/blob/master/R/fgseaMultilevel.R)
- defines the number of gene sets used to calculate probabilities?
But the concept of this sampleSize
number still remains quite abstract to me (why is it 101 by default), yet it looks very important and is easily changeable (not very foolproof if someone like me plays around with the parameters). And the definition states ...size = pathwaySize (confused)?
Basically I am hoping for any input on what does it do in simple terms and when should I change it? The results do vary quite a bit if I try say 11 vs 101 vs 1001 (in terms of the number of significantly enriched pathways). If it's something similar to sampling random gene sets for P-value calculation (which I think is the classical approach) then I am inclined to think this IS something that should potentially be tailored to suit specific needs...
Please help, thanks.
Yours truly,
Desperate For Help
Thank you so much!
This is perfect, really helpful. A good take home message for myself is then to not play around with every parameter. But in case I want to do so, now I know what effect this has.