Entering edit mode
3.3 years ago
garfield320
▴
20
I'm using the fgsea
package in R to run GSEA, and according to their documentation, the "size" parameter in the result refer to the "size of the pathway after removing genes not present in 'names(stats)'".
So if my pathway contained 100 genes, and the "size" in my result is 10, does that mean that 90 genes in my experiment overlapped with my pathway? If this understanding is correct, is there any way to easily pull out the size of the pathway itself, instead of the size after removing not present genes?
it means that only 10 of the genes had gene-level statistic values in the input. Other 90 genes won't be considered, as no rank can be assigned to them.
1) If these are true numbers, this looks very suspicious. Usually most of the pathway genes should be ranked. 2) In the context of GSEA It's incorrect to report full gene set size, as we can't say anything about the genes that were not present. 3) If you really do want this, you can just calculate the sizes yourself, it's shouldn't be hard.
alserg These are true numbers, I was checking some pathways manually and I actually seem to have many pathways that were identified significant and have "size" that is <10% of the total number of genes in the pathways. Is there a general threshold for how many pathway genes should be ranked? Any suggestions on what I should inspect to figure out what's going on with my size parameter?
what is the length of your input stats vector? Should be at least ~10K.
alserg I only have 3K. I guess I should have been clear in my initial question, but I'm using pre-ranked GSEA (using log of p value multiplied by the sign of fold change) to analyze proteomic data acquired from the mass spec, and getting 10K proteins on the mass spec would be impossible.
Got it. Then having 10% could be OK. Still, reporting the initial pathway size can be misleading, as these other genes are not considered at all.
I'm also noticing that some of my pathways have much larger sizes (let's say 90%), but the p values for those pathways are really high so I had filtered them initially. Is it possible that the length of my input is somehow affecting the statistical process?