fGSEA with preranked data based off gene expression using DeSeq2
1
1
Entering edit mode
4.5 years ago
jack.henry ▴ 50

I am trying to do some exploratory bioinformatics on TCGA data using fgsea.

Our lab looks at a specific gene so I was trying to see whether high levels of this gene in TCGA expression data is correlated with enrichment of any genesets. I have been preranking the data using DeSeq2 (and using the F stat as a ranking) and was wondering how I should set up the design.

Because it is a continuous variable I could plug in the scaled normalised counts for this gene straight into the DeSeq2 design or I could split the expression into low/high groups and then run the DeSeq2 to calculate the difference between low/high.

I was wondering whether which of these (if either) is more acceptable? I assume using the continuous variable makes the most sense but I have only seen it done by splitting the expression into two groups by other bioinformatics. Is the Wald test with DeSeq2 the most appropriate tool to do this with?

I have run both methods using the hallmark genesets and see very different ranking and similar but slightly different ES results. What are peoples' thoughts?

NES from HALLMARK enter image description here

RNA-Seq DeSeq2 gsea fgsea R • 3.7k views
ADD COMMENT
3
Entering edit mode
4.5 years ago
dsull ★ 7.0k

Personally, I think the low/high stratification isn't ideal because you lose information about the expression of your gene of interest (you're collapsing everything into two values: low or high). I prefer the continuous design (edit: however, please see discussion below; important caveats).

An alternate approach would be to calculate the pair-wise correlation between every gene with respect to your gene of interest (using normalized count values); you can use the correlation coefficients are your ranking. Whether this is "better" than using the deseq2 statistic, I don't know. There are many ways to analyze data and the answer of what is "most acceptable" is not always clear or easy.

ADD COMMENT
2
Entering edit mode

I guess it depends a bit on the range of expression of that gene across the samples. If you use it as a continuous variable and it is poorly-expressed in some but "off-the-chart/super high" in some others wouldn't then a stratification make more sense, maybe low-middle-high?

ADD REPLY
2
Entering edit mode

Ah, yes, agreed, I would recommend looking at the distribution of expression of that gene and see if distinct clusters exist.

Let's say there are six samples. In an extreme case for your gene of interest, three samples may have expression values 0.01, 0.02, 0.03 while three other samples may have expression values 100, 100.01, 100.02. The minor within-cluster 0.01 differences aren't meaningful and might screw up a continuous variable analysis (especially something like a pair-wise spearman correlation). Definitely would recommend a stratification in this particular case.

On the other hand, if your expression values for your gene of interest are 10, 20, 30, 40, 50, 60 -- a stratification might not be such a great idea.

I find that continuous variables work well in my experience, generally speaking, but there's no one-size-fits-all. It's a similar issue that often comes up with survival analysis: Do a cox regression with respect to gene expression as a continuous variable, or separate patients into high and low groups and show the two survival curves?

ADD REPLY
1
Entering edit mode

Thats a good point, thank you both. I see a pretty normally distributed expression so I think I am going to proceed with a continuous design and see where it takes me.

Histogram

ADD REPLY
0
Entering edit mode

Thanks a lot! I guess I don't mean the "most" but the "more" acceptable method. (edited to change it) I only ask because I only started R only a month ago and I have worried that was I was doing is just unaccepted in the bioinformatics community or something, it's settling to hear that I am thinking along the right lines!

ADD REPLY

Login before adding your answer.

Traffic: 937 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6