I am trying to do some exploratory bioinformatics on TCGA data using fgsea.
Our lab looks at a specific gene so I was trying to see whether high levels of this gene in TCGA expression data is correlated with enrichment of any genesets. I have been preranking the data using DeSeq2 (and using the F stat as a ranking) and was wondering how I should set up the design.
Because it is a continuous variable I could plug in the scaled normalised counts for this gene straight into the DeSeq2 design or I could split the expression into low/high groups and then run the DeSeq2 to calculate the difference between low/high.
I was wondering whether which of these (if either) is more acceptable? I assume using the continuous variable makes the most sense but I have only seen it done by splitting the expression into two groups by other bioinformatics. Is the Wald test with DeSeq2 the most appropriate tool to do this with?
I have run both methods using the hallmark genesets and see very different ranking and similar but slightly different ES results. What are peoples' thoughts?
I guess it depends a bit on the range of expression of that gene across the samples. If you use it as a continuous variable and it is poorly-expressed in some but "off-the-chart/super high" in some others wouldn't then a stratification make more sense, maybe low-middle-high?
Ah, yes, agreed, I would recommend looking at the distribution of expression of that gene and see if distinct clusters exist.
Let's say there are six samples. In an extreme case for your gene of interest, three samples may have expression values 0.01, 0.02, 0.03 while three other samples may have expression values 100, 100.01, 100.02. The minor within-cluster 0.01 differences aren't meaningful and might screw up a continuous variable analysis (especially something like a pair-wise spearman correlation). Definitely would recommend a stratification in this particular case.
On the other hand, if your expression values for your gene of interest are 10, 20, 30, 40, 50, 60 -- a stratification might not be such a great idea.
I find that continuous variables work well in my experience, generally speaking, but there's no one-size-fits-all. It's a similar issue that often comes up with survival analysis: Do a cox regression with respect to gene expression as a continuous variable, or separate patients into high and low groups and show the two survival curves?
Thats a good point, thank you both. I see a pretty normally distributed expression so I think I am going to proceed with a continuous design and see where it takes me.
Thanks a lot! I guess I don't mean the "most" but the "more" acceptable method. (edited to change it) I only ask because I only started R only a month ago and I have worried that was I was doing is just unaccepted in the bioinformatics community or something, it's settling to hear that I am thinking along the right lines!