Question

optimal size of gene sets for fgsea analysis

0

Entering edit mode

5.0 years ago

ayuhutamisyarif5 • 0

Hi,

probably someone has already raised this question. How large the size of a gene set to be used as an input for fgsea or another gsea analysis?

I use GO biological process from Msigdb collection and it seems that it has larger gene sets (7529 gene sets, I believe).

when I run the fgsea analysis and set them into minSize 15, maxSize 500, and nperm 1000, I did not find any significant pathways (adjusted p value < 0.05). However, once I increased the permutation to 10000, I saw that there were 277 significant pathways.

I am just wondering whether it was a good idea to increase the permutation as a way of compensating larger dataset such as GO biological process? or should I just set a more stringent minSize and maxSize parameter?

Kindly provide any information about this.

Thanks in advance.

Best,

Ayu

fgsea pathway • 4.7k views

ADD COMMENT • link updated 5.0 years ago by alserg ▴ 1000 • written 5.0 years ago by ayuhutamisyarif5 • 0

score 2 · Answer 1 · 2020-08-28

2

Entering edit mode

5.0 years ago

alserg ▴ 1000

Please, update your fgsea to version >=1.13.2, the underlying algorithm has been changed then a doesn't require setting nperm at all. There you won't have such problem

ADD COMMENT • link 5.0 years ago by alserg ▴ 1000

1

Entering edit mode

Is there actually an argument against using larger sets than 500 genes with fgsea? Or is this some relic from the original GSEA implementation? I can think about situations where one compares completely different cell types so has a lot of DEGs that are all shaping the identity of one or the other celltype.

ADD REPLY • link 5.0 years ago by ATpoint 89k

2

Entering edit mode

There are couple of related points why limiting the gene set size can be a good idea:

Still the larger is the gene set, the more time it takes.
Usually large gene sets are harder to interpret because they are too general.
The GSEA null hypothesis is that the set is distributed uniformly at random along the gene ranking. For larger structured gene sets that tends to be the case, even when there is no biological significance. In particular, artifacts from gene-gene expression correlation start to play an important role.

So, overall, on GO collection setting max size to 500 seems to be a good idea. But for some collections, such as transcription targets or ChIP-seq genes, it can be advantageous to not set any limit at all.

ADD REPLY • link 5.0 years ago by alserg ▴ 1000