I'm analysing an RNA-Seq dataset and want to do preranked GSEA to find out what pathways are changing in my dataset.
I've read that this type of analysis should be done using ALL the genes detected in the experiment, not just DE genes. This makes sense to me for relatively simple experiments (ie condition A vs condition B). However, my experiment involved 5 conditions. As such, I've performed K-means clustering to pick out expression patterns and have decided on some interesting clusters to compare with one another.
Some clusters are paired because they are upregulated/downregulated in condition A vs B, but are clustered separately to the rest of the genes due to their expression patterns in C,D and E. Just for clarity, A/B is control versus stimulus and C/D/E are drug conditions. I'm interested in genes which are affected by the stimulus but grouped by their sensitivity to drugs. Thus, I end up with clusters which equate to Up+drug-insensitive/down+drug-insensitive or Up+drug-sensitive/down/drug-sensitive etc...
I thought I would run these clusters in a pre-ranked GSEA, with the ranking statistic being the Log2FC (or some other statistic) of the control vs stimulated comparison. However, I'm a little concerned that I'm violating the statistical assumptions of GSEA by filtering in this way.
Any thoughts?
Which filtering are you referring to?
If you mean clustering, then that is comparable to what you would do if you were to sort the populations for bulk RNA-seq, which is a common approach.