Question

How p value is calculated in GSEA?

0

Entering edit mode

4.3 years ago

Eliran Turgeman ▴ 10

I have two gene sets from GSEA for example; and I am calculating the intersection between them. I am trying to calculate the p value that the intersection of these sets occurs by chance.
I wrote python script that calculates that p value but I wanted to double check if that's indeed the value i'm supposed to get.
In this website you can do the exact same thing and get the p value - for example here's a link to a comparison between the gene set MYMOD_07 and others.
As you can see, one of the columns is the p value and I was wondering how was it actually calculated?
Is it two sided or one sided? if it's one sided then from which side? Is it the logarithmic value?

geneset p value • 6.3k views

ADD COMMENT • link updated 4.3 years ago by i.sudbery 20k • written 4.3 years ago by Eliran Turgeman ▴ 10

0

Entering edit mode

https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/FAQ#Where_are_the_GSEA_statistics_.28ES.2C_NES.2C_FDR.2C_FWER.2C_nominal_p_value.29_described.3F

ADD REPLY • link 4.3 years ago by GenoMax 147k

0

Entering edit mode

I came across this page before but I'm quite a beginner in bioinformatics and especially in statistics, hence It's quite hard for me to navigate there for my answer through all the lingo I'm not familiar with.
Could you please explain the answer explicitly?

ADD REPLY • link 4.3 years ago by Eliran Turgeman ▴ 10

score 8 · Answer 1 · 2020-08-19

GSEA works by ranking all genes used in an analysis by some statistic. So this might be all genes in the genome (traditionally all genes with a probe on an array).

Start with an enrichment score of 0. Then take the gene ranked No. 1, and ask if it is in your gene-set. If it is you increase the enrichment score, if not decrease the enrichment score. Then move on to the gene ranked No. 2 and do the same thing. And so on down the whole list of genes. When you've done, find the highest the enrichment score you ever got. This is your Enrichment Score for that gene-set.

Next randomly shuffle the gene ranks (or sample labels) to put them in a random order, and repeat the scoring process above. Do this 1000 times.

Two things are now calculated - The P-value is calculated by looking at how often the Enrichment score from actual ranking is bigger than that for the random permulations. The Normalised Enrichment Score is calcualted by dividing the Enrichment Score from the actual ranking by the mean of the random permutations.

The FDR is calculated by comparing the distribution of Normalised Enrichment Scores from many different genesets.