Differential analysis of enrichment results
3
1
Entering edit mode
8 weeks ago
Aspire ▴ 350

I have expression data across multiple time points. I've clustered the data using kmeans (after per-gene standardization). There are two kmeans clusters which seem interesting, and I wish to compare them.

I thought of performing enrichment analysis for each of the clusters, and then seeing which of the enriched sets identify each of the clusters uniquely.

Is there a correct, well-established way of doing this?

Are there pitfalls to look for?

For example, if a specific set (eg biological pathway) has an adjusted p-value of 0.04 in one of the clusters, but a value of 0.06 in another, then it makes no sense to consider this set unique to only one of the clusters.

Clarification :

This is time-course data, so there is no need to cluster by samples, as they are ordered by time. I'd rather not post my own data, but I have found a published image which demonstrates this :

enter image description here From here

Suppose one would be interested to know which are the molecular functions that identify cluster III, versus the ones that identify cluster IV - how should that be done?

statistics enrichment • 668 views
ADD COMMENT
1
Entering edit mode
8 weeks ago

This is an interesting question, and not one I am aware of their being an answer to.

What I will say is that you should probably not just compare the output of fgsea or clusterProfiler on two clusters, and say that any term enriched in cluster 1, but not in cluster 2 is different between the two clusters, because a failure to find a significant enrichment, doesn't mean there isn't an enrichment, just that the evidence for it does not meet some arbitrary threshold for signficance. If "WNT pathway" is enriched with an Odds ratio of 2.4 and an adjusted p-value of 0.049 in cluster 1, and an Odds ratio of 2.3 and an adjusted p-value of 0.051, is it really reasonalbe to say that cluster 1 is associated with the WNT pathway and cluster 2 isn't? Similarly, if "Kinase activity" is enriched with an Odds Ratio of 10 in cluster 1 (p-value < 10e-16) and an Odds ratio of 1.2 (p-value = 0.04) is it really fair to say it is the same?

However, I don't know of any pacakage for statistically comparing enrichments between clusters/conditions. One idea would be to test enrichment using a logistic regression, and then including the two conditions as an interaction factor in the model.

ADD COMMENT
0
Entering edit mode

What I intend to do is to see the enrichments of the two genes lists (for the two clusters) in a network view-mode (with ShinyGo or a similar tool). Like this one: enter image description here

The network view is not a rigorous statistical framework for comparing enrichments. However, imho, it can give a good bird-eye overview of the differences between the two clusters. In this way, I hope that if GO term X would have a p-value of 0.049 in cluster 1, but 0.051 in cluster 2, then there would be other terms in cluster 2 that are similar to GO term X in cluster one, but would be significant. I think that if I get a strong network of interrelated terms in cluster 1, and this specific interrelated network is missing in cluster 2, then it is reasonable evidence for difference in enrichments.

How does that sound?

ADD REPLY
0
Entering edit mode
8 weeks ago

If your expression data is formatted as gene counts (whole numbers) from an RNASeq experiment, you should consider running something like DESeq2 to get differential expression and then you can do GSEA of the differential using something like clusterProfilter or fgsea. The vignettes for these packages are very detailed and will be able to guide you, but happy to answer any questions also.

It is also important to note that any clustering algorithm will always provide at least 2 clusters. You should visualise your data as @BioinfGuru said and overlay some known variables like batches/timepoints etc to see if you are getting technical variation.

ADD COMMENT
0
Entering edit mode

DESeq2 is technically not possible, as the genes in two of the clusters (clustered via kmeans) are mutually exclusive.

ADD REPLY
1
Entering edit mode

"I've clustered the data" usually means you included all samples (with all genes) to see what samples cluster together based on there gene expression profile. Each cluster always contains a group of samples (with all genes).

You may want to edit the original question explaining exactly what you clustered, how, and why.

ADD REPLY
0
Entering edit mode

Thanks, edited

ADD REPLY
0
Entering edit mode

You should be clustering samples and not genes. I'm not sure I can see a sound rationalle for what you have described.

ADD REPLY
0
Entering edit mode
8 weeks ago
BioinfGuru ★ 2.1k

That clarification helped a lot. Thank you. I was way off.

DESeq2 will still do the job:

Using an interaction within the design model, you also can extract how much "changes in gene expression over time" are affected by the condition (WT or Daphne). The DEGs of each cluster can then be passed to GSEA giving you terms for each cluster. If you have access to cytoscape or some other network analysis tool, the DEGs may even get nice network coloured by cluster.

Just some thoughts to hopefully give some direction.

ADD COMMENT

Login before adding your answer.

Traffic: 923 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6