Hello Biostars,
I am quite confused over why some enrichment tools do not ask for a background data set to perform GSEA or GO analysis but others do. For example tools such as GOrilla and DAVID do ask for a background data set, while other tools such as Panther and Enrichr do not ask for a background data set. If Panther and Enrichr do not ask for a background data set then they must be using all of the data in their search (species specific) space to provide the resulting enrichment's, p-values and other associated scores.
I am quite confused here. If panther and Enrichr can provide good GSEA without a background, why use a background at all?
I guess what I am asking is what is the statistical and biological difference between using a background list of genes during GSEA, and not using a background list?
I would appreciate any guidance in this issue. Also if anyone could point me to good literature on this matter It would be appreciated?
It may be possible that you are confusing the "classical" gene enrichment analyses such as GO (which are based on hypergeometric tests and such, and indeed benefit from inputting a universe/background specific to your experiment) with the "classical" GSEA, which is a quite different kind of approach in which you already input your whole universe/background of genes (accompanied by expression or ranking values associated to each gene)
Hi Papyrus, thank you for the answer. I feel like this is likely, could you please elaborate more on the matter? Are there any reviews/ chapters which discuss this?
Yes, ofc. Maybe this review is a good starting point Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges.
It all depends on whether you only have your list of significant genes, in which case you can go to GOrilla for example and input that list, along with the list of background analyzed genes (for example, say you have DEGs from RNA-seq, then your universe/background would be the filtered list of genes for which you tested to find the DEGs). R/Bioconductor tools such as goseq are also useful for this purpose if you want to try something different than web tools.
On the other hand, the "more advanced" enrichment analyses such as GSEA try to use the information on all of the genes (be it expression, or something else), to find enriched functions/pathways. Thus your only input is usually your universe/background accompanied by the expression values. (And there are also R/Bioconductor tools to do these analyses, limma for instance provides functionality for analyses similar to (or with clearer hypotheses) than GSEA.
Thank you, this is very useful information.