Hi,
I'm trying to run a GO enrichment analysis in R
. I'm using the gage
package, and the GO terms are downloaded from ensembl using the biomaRt
package. My problem is that I'm getting too many enriched categories and they're pretty redundant. This is after using an FDR p-value = 0.05 cutoff and only testing for GO categories with 10-50 genes in order to avoid too esoteric categories or too general ones.
I came across two solutions to this issue:
It's possible to cluster GO terms using pairwise distances between them, which can be obtained by packages such as
GOSim
, using the functiongetTermSim
. However, if I get a few hundreds of enriched terms which I'd like to cluster in order to remove redundancy,getTermSim
takes very very long, hence is impractical.Use go-slim terms. For that I use the
GSEABase
package and download goslim files from geneontology.org, and use that to trim the GO terms downloaded usingbiomaRt
. The problem here, is that at least for human data - which is what I'm analyzing, the go-slim terms seem a bit poor to me.
So my question is if there's a solution to this? some happy medium?
Is there a precomputed file of all pairwise GO term distance that can be downloaded? That'll save calling getTermSim
each time I run the script.
I usually find that topGO is a good algorithm to get rid of the excessive redundancy of GO terms. It also often reports medium-sized categories as the most significant ones.