Question

Differential analysis of enrichment results

1

Entering edit mode

5 months ago

Aspire ▴ 370

I have expression data across multiple time points. I've clustered the data using kmeans (after per-gene standardization). There are two kmeans clusters which seem interesting, and I wish to compare them.

I thought of performing enrichment analysis for each of the clusters, and then seeing which of the enriched sets identify each of the clusters uniquely.

Is there a correct, well-established way of doing this?

Are there pitfalls to look for?

For example, if a specific set (eg biological pathway) has an adjusted p-value of 0.04 in one of the clusters, but a value of 0.06 in another, then it makes no sense to consider this set unique to only one of the clusters.

Clarification :

This is time-course data, so there is no need to cluster by samples, as they are ordered by time. I'd rather not post my own data, but I have found a published image which demonstrates this :

enter image description here From here

Suppose one would be interested to know which are the molecular functions that identify cluster III, versus the ones that identify cluster IV - how should that be done?

statistics enrichment • 784 views

ADD COMMENT • link 5 months ago by Aspire ▴ 370

score 1 · Answer 1 · 2024-07-09

This is an interesting question, and not one I am aware of their being an answer to.

What I will say is that you should probably not just compare the output of fgsea or clusterProfiler on two clusters, and say that any term enriched in cluster 1, but not in cluster 2 is different between the two clusters, because a failure to find a significant enrichment, doesn't mean there isn't an enrichment, just that the evidence for it does not meet some arbitrary threshold for signficance. If "WNT pathway" is enriched with an Odds ratio of 2.4 and an adjusted p-value of 0.049 in cluster 1, and an Odds ratio of 2.3 and an adjusted p-value of 0.051, is it really reasonalbe to say that cluster 1 is associated with the WNT pathway and cluster 2 isn't? Similarly, if "Kinase activity" is enriched with an Odds Ratio of 10 in cluster 1 (p-value < 10e-16) and an Odds ratio of 1.2 (p-value = 0.04) is it really fair to say it is the same?

However, I don't know of any pacakage for statistically comparing enrichments between clusters/conditions. One idea would be to test enrichment using a logistic regression, and then including the two conditions as an interaction factor in the model.

score 0 · Answer 2 · 2024-07-08

0

Entering edit mode

5 months ago

yura.grabovska ▴ 690

If your expression data is formatted as gene counts (whole numbers) from an RNASeq experiment, you should consider running something like DESeq2 to get differential expression and then you can do GSEA of the differential using something like clusterProfilter or fgsea. The vignettes for these packages are very detailed and will be able to guide you, but happy to answer any questions also.

It is also important to note that any clustering algorithm will always provide at least 2 clusters. You should visualise your data as @BioinfGuru said and overlay some known variables like batches/timepoints etc to see if you are getting technical variation.

ADD COMMENT • link 5 months ago by yura.grabovska ▴ 690

0

Entering edit mode

DESeq2 is technically not possible, as the genes in two of the clusters (clustered via kmeans) are mutually exclusive.

ADD REPLY • link 5 months ago by Aspire ▴ 370

1

Entering edit mode

"I've clustered the data" usually means you included all samples (with all genes) to see what samples cluster together based on there gene expression profile. Each cluster always contains a group of samples (with all genes).

You may want to edit the original question explaining exactly what you clustered, how, and why.

ADD REPLY • link 5 months ago by BioinfGuru ★ 2.1k

0

Entering edit mode

Thanks, edited

ADD REPLY • link 5 months ago by Aspire ▴ 370

0

Entering edit mode

You should be clustering samples and not genes. I'm not sure I can see a sound rationalle for what you have described.

ADD REPLY • link 5 months ago by yura.grabovska ▴ 690

score 0 · Answer 3 · 2024-07-08

That clarification helped a lot. Thank you. I was way off.

DESeq2 will still do the job:

Using an interaction within the design model, you also can extract how much "changes in gene expression over time" are affected by the condition (WT or Daphne). The DEGs of each cluster can then be passed to GSEA giving you terms for each cluster. If you have access to cytoscape or some other network analysis tool, the DEGs may even get nice network coloured by cluster.

Just some thoughts to hopefully give some direction.