Question

Hypergeometric GO analysis? and network diagram of generated GO terms

1

Entering edit mode

8.4 years ago

mforde84 ★ 1.4k

Hi,

I have a sequencing panel of approx. 1100 genes for 100 or so patients. We found that approx. 150 of these genes exhibit mutational frequencies greater than 5% in our population, and we want to get an idea of the enriched GO terms for this subset of genes.

I tried to do this using tools which perform hypergeometric testing for enrichment, and I didn't get any significant results. Pvalues were significant but not after correction, and I think there are a number of type 1 errors due to the nature of the nonparametric testing employed.

So I read up and and implemented a parametric method whereby I randomly subsample 150 genes from the total panel list, determine the ratio of each GO term compared to all terms, bootstrap this process 1000x times, calculate the probability of the observed ratio given the distribution of the randomly bootstrapped ratios, then correct for multiple testing using BH. I've checked the resulting ratios and they exhibit a normal distribution, mostly. The only ones which don't are very rare GO terms (eg., 1 associated gene) and can be easily accounted either by -log normalization or by removal from subsequent analyses.

Results look pretty good, and show enrichment for terms we are expecting, so I think it's a legitimate method. I've also tried increasing the number of bootstraps, and it doesn't have a significant impact on the mean or standard deviation for the ratio distributions, so I'm not just generating swaths of data to artificially create a significant enrichment. I've seen this method implemented elsewhere, so there must be some sort of validity to it. I was even thinking because it's a parametric method, it's likely more powerful than the hypergeometric method? I'm curious if anyone has any comments, concerns, or words of caution about the approach.

Also, given this list of significantly enriched GO terms, we'd like to generate a network diagram to help visualize the result. However, for some reason, my boss is dead set on using a specific piece of software (i.e., ClueGO) even though it won't work for this particular application. So in turn, I have to find a way to hack a solution. I've tried REVIGO, which actually worked very well, however the interconnectedness between nodes was a bit scarce, as it's based upon semantic similarity determined by SimRel. Ideally, I'd like to generate something similar to: https://www.researchgate.net/profile/Annika_Raupach/publication/265094954/figure/fig5/AS:288698752745472@1445842554015/Figure-5-ClueGOcytoscape-analysis-of-AKT-regulated-phosphoproteins-AKT2-speci-fi-c-A.png . I've been able to generate some decent networks with nodes and edges, so I can get the groundwork going, but I'm curious if there is a resource I'm missing that will specifically take GO terms as input and meaningfully generate edges between terms clustered together into specific functional groupings.

Thanks, Martin

gene ontology network diagram visualization • 4.2k views

ADD COMMENT • link updated 6.1 years ago by Biostar 20 • written 8.4 years ago by mforde84 ★ 1.4k

1

Entering edit mode

The one in the figure is manually done with cytoscape plugins , I reckon. You can use cytoscape and play around to do the same. When you say nodes and edges you need to put weights to your graph based on your adjacency matrix being generated from your nodes. Only then you will be able to derive such kind of graphs. I believe if this 150 genes now have some kind of enrichment with enrichment value and p-value, you can always use them in the cytoscape to build up your desired network. Or you can use REVIGO and model the rscript scode which is generate online to exploit it your benefit. However it uses semantic similarity.

ADD REPLY • link 8.4 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Yea, that seems like the most reasonable option. REVIGO has been instrumental in this regard, however I had been having some issues generating more applicable edges, which has been addressed in another comment / answer. Thanks for your input, it's much appreciated.

ADD REPLY • link 8.4 years ago by mforde84 ★ 1.4k

1

Entering edit mode

Make sure you control for gene size when you run your subsampling

ADD REPLY • link 8.4 years ago by russhh 5.8k

0

Entering edit mode

Thanks for your response. Would you be able to clarify a bit? At this point the 5% and greater list isn't corrected for size, or atleast not known to be significantly mutated compared to sequencing error or natural mutational frequency. We did do an analysis with mutsig which does correct for this, however we only have a hundred or so samples, so the power for the analysis is as I understand it relatively low and we only reported 4-5 significantly mutated genes e.g., typical drivers like KRAS, TP53, etc.

ADD REPLY • link 8.4 years ago by mforde84 ★ 1.4k

Ram · Answer 1 · 2016-12-01

1

Entering edit mode

8.4 years ago

LLTommy ★ 1.2k

Ok, when you say "clustering into specific functional groupings" would you be happy with "biological_process", cellular_component and molecular function or should it be more specific as in the example above? I could imagine that the picture you posted somebody put in manual work to create these groups and I don't know how or if you could do that automatically (at least not in a quick way).

However, if you just want a quick hacky solution to produce a graph that you can show your boss (and are happy with the general grouping described above), I might be able to help you rather quickly - even though this might not do exactly what you are looking for. So post a some of those 150 GO terms from different groups and I give it a try.

ADD COMMENT • link updated 8.4 years ago by Ram 45k • written 8.4 years ago by LLTommy ★ 1.2k

0

Entering edit mode

Hi Tommy,

Thanks for offering help. That's very kind and generous of you. Unfortunately, the actual dataset is not mine to disseminate. I don't mean any offense by this, as I think you only have the best intentions in mind, and I truly apologize if this comes off as rude. Just my the people I'm working with on this are very protective of their data so I can't really take liberties with it.

I've been trying to manually trim the cytoscape networks and I really wish there was some programmatic way to remove nodes and edges from a network diagram. For example, if you look at ClueGO it has options to report terms for all GO levels (e.g., from root to most detailed level of annotation). So I can put in a list of genes and easily generate a network for everything, however the issue at hand with this approach is that there are over 1000 nodes to sift through and many more edges. Is there a way that I can script a filter instead?

Also by clustering, I was talking about some generalizable annotations we have for a list of significant GO terms. For instance, we might have 10-20 significant GO terms which in some way shape or form are associated with TGFbeta signaling / production. So it makes sense for use to group these terms together in the network. We have 5 or 6 groups at the moment, I'm just trying to figure the best way to determine edges between these terms. I'm thinking it might make sense to look at % of overlapping genes?

ADD REPLY • link 8.4 years ago by mforde84 ★ 1.4k

0

Entering edit mode

No offense taken. I figured that you might want to protect your data, that is why I said post some GO terms, not all of them, and I might show you a quick hacky solution. What I have in mind might not be what you want anyway. Plus, what I don't understand why you have 1000 nodes and even more edges, I thought is't 150? And in addition you got some different options to explore already.

ADD REPLY • link 8.4 years ago by LLTommy ★ 1.2k

score 1 · Answer 2 · 2016-12-01

1

Entering edit mode

8.4 years ago

TriS ★ 4.7k

the two tests look for two slightly different things: hypergeometric test looks for the probability of picking an item without replacement, while bootstrapping defines your confidence intervals with replacement. increasing the # of bootstraps will increase the precision of your empirical p.value but should maintain the initial distribution. in the past I used both bootstrapping and hypergeometric test but results were comparable. however, I focused on gene expression instead of mutations, which could change since we talk about continuous vs. discrete values. so, I think that the approach you took is correct.

for the graph I agree with what previously said that manually drawing it could be your best bet. if you want to be very precise about the hierarchy of your ontology you can go to http://www.geneontology.org/ and check how the different terms are interconnected, then build a graph based on that.

ADD COMMENT • link 8.4 years ago by TriS ★ 4.7k

0

Entering edit mode

Thanks for clarifying some of my concerns. It's good to confirm that my approach has some basis in reality. One thing I should note is that the sample I performed is without replacement using R CRANS default setting for sample()

#retrieve bootstrapped sampling ratio
goboot <- function(goid, allgene, sampleSize, nboot=1000) {
    ratio <- rep(0, nboot)
    associated_genes <- unique(get(goid, org.Hs.egGO2ALLEGS))
    allgeneInCategory <- intersect(associated_genes, allgene)
    for (i in 1:nboot) {
        gene.sample <- sample(allgene, sampleSize)
        k <- sum(gene.sample %in% allgeneInCategory)
        ratio[i] <- k/sampleSize
    }
    return(ratio)
}

So each iteration has a unique set of ~150 genes. However, there can and will be overlap across iterations, is this what your referring to as replacement?

ADD REPLY • link 8.4 years ago by mforde84 ★ 1.4k

0

Entering edit mode

for replacement I mean that every time you use the sample() function you can re-pick the same genes. this can be done by using gene.sample <- sample(allgene, sampleSize, replace=T)

ADD REPLY • link 8.4 years ago by TriS ★ 4.7k

score 1 · Answer 3 · 2016-12-01

1

Entering edit mode

8.4 years ago

Jean-Karim Heriche 27k

Create a graph where each node is a GO term and weight the edges by the number of genes they share. Load into Cytoscape and play with it to get what you like. Use node attributes as explained in this post to vary node size.

ADD COMMENT • link 8.4 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks for the info. By any chance do you know of any documentation which addressed assigning multiple colors to a node?

ADD REPLY • link 8.4 years ago by mforde84 ★ 1.4k

0

Entering edit mode

You can assign two colors to a node by giving one color to the border and one to the center using the same approach. Alternatively, there are the MultiColoredNodes or the enhancedGraphics plugins but I haven't tried them.

ADD REPLY • link 8.4 years ago by Jean-Karim Heriche 27k