Question

GO enrichment analysis using R

4

Entering edit mode

8.0 years ago

rubic ▴ 270

Hi,

I'm trying to run a GO enrichment analysis in R. I'm using the gage package, and the GO terms are downloaded from ensembl using the biomaRt package. My problem is that I'm getting too many enriched categories and they're pretty redundant. This is after using an FDR p-value = 0.05 cutoff and only testing for GO categories with 10-50 genes in order to avoid too esoteric categories or too general ones.

I came across two solutions to this issue:

It's possible to cluster GO terms using pairwise distances between them, which can be obtained by packages such as GOSim, using the function getTermSim. However, if I get a few hundreds of enriched terms which I'd like to cluster in order to remove redundancy, getTermSim takes very very long, hence is impractical.
Use go-slim terms. For that I use the GSEABase package and download goslim files from geneontology.org, and use that to trim the GO terms downloaded using biomaRt. The problem here, is that at least for human data - which is what I'm analyzing, the go-slim terms seem a bit poor to me.

So my question is if there's a solution to this? some happy medium?

Is there a precomputed file of all pairwise GO term distance that can be downloaded? That'll save calling getTermSim each time I run the script.

GO GO-slim enrichment-analysis R • 31k views

ADD COMMENT • link updated 8.0 years ago by Guangchuang Yu ★ 2.6k • written 8.0 years ago by rubic ▴ 270

3

Entering edit mode

I usually find that topGO is a good algorithm to get rid of the excessive redundancy of GO terms. It also often reports medium-sized categories as the most significant ones.

ADD REPLY • link 8.0 years ago by Martombo ★ 3.1k

score 4 · Answer 1 · 2016-11-15

4

Entering edit mode

8.0 years ago

Guangchuang Yu ★ 2.6k

Maybe you can try clusterProfiler, which can do GO enrichment analysis in either hypergeometric test or GSEA.

It can simplify the result by removing highly similar terms calculated by GOSemSim.

ADD COMMENT • link 8.0 years ago by Guangchuang Yu ★ 2.6k

0

Entering edit mode

But like GOSim, clusterProfiler generate a pairwise semantic distance matrix, which takes very long

ADD REPLY • link 8.0 years ago by rubic ▴ 270

0

Entering edit mode

should output in reasonable time.

ADD REPLY • link 8.0 years ago by Guangchuang Yu ★ 2.6k

score 1 · Answer 2 · 2016-11-15

1

Entering edit mode

8.0 years ago

Carlo Yague 8.9k

My problem is that I'm getting too many enriched categories and they're pretty redundant.

A third solution could be to filter out enriched GO categories based on

pval (be more stringent)
number of genes in categories (very big groups are often not very informative - yes I'm talking to you "cellular process")
minimal number of genes enriched in categories (sometimes, having just one gene enriched in a category is found significant, especially if the category is very small)

ADD COMMENT • link 8.0 years ago by Carlo Yague 8.9k

2

Entering edit mode

Thanks for the response. I'm actually already applying these filters - just updated that in my post.

ADD REPLY • link 8.0 years ago by rubic ▴ 270