I'm trying to run GSEA on a ranked list of genes. In other words, I'm not using expression data (instead, I'm using a list of genes ranked by the prevalence of variants in those genes in my dataset). I can't figure out how to run GSEA using non-standard input files - either the desktop version or the R version. Each tutorial I can find details how to run GSEA on expression files that contain expression levels from each individual subject, while I already have a list of genes I'm interested in.
You could just pretend that the rankings are expression levels (you may have to reverse the ordering such that the most prevalently affected gene has the highest number). One of the first calls in any GSEA function is rank(), afterall. If that doesn't seem to be working well for you then let me know and I can just post some R code.
Thanks for the help. For some reason, it isn't working correctly. I'm assuming this is my issue, as my coding background is rather weak. I'll keep trying...
You can also try directly doing a ks.test() as lkmklsmn mentioned.
BTW, regardless of the test you end up using, do have a look at the results yourself. Tests like this that compare distributions have some known issues when it comes to finding statistically significant but likely biologically meaningless results.
The GSEA algorithm is based on the Kolmogorov-Smirnov statistical test. This method test for a shift in ranks between a set of interest and the background. You would basically be asking the question, is this particular set of genes enriched among the top genes in the ranked list of all genes?
This is fairly simple to do in R. The code would look like this (not run):
scores<- a numeric vector of your scores (prevalence of variants) of all genes in your dataset
ranking<-rank(scores)
ind<- a numeric vector containing the indices of your gene set in scores
geneset<-ranking[ind]
background<-ranking[-ind]
ks.test(geneset,background)
You could just pretend that the rankings are expression levels (you may have to reverse the ordering such that the most prevalently affected gene has the highest number). One of the first calls in any GSEA function is
rank()
, afterall. If that doesn't seem to be working well for you then let me know and I can just post some R code.Thanks for the help. For some reason, it isn't working correctly. I'm assuming this is my issue, as my coding background is rather weak. I'll keep trying...
You can also try directly doing a ks.test() as lkmklsmn mentioned.
BTW, regardless of the test you end up using, do have a look at the results yourself. Tests like this that compare distributions have some known issues when it comes to finding statistically significant but likely biologically meaningless results.