I would like to hear your opinion about this approach.
I am doing over-representation analysis on differentially expressed genes from RNA-Seq.
Instead of doing a single test, let's say, for up-regulated genes with a log2FC cut off of x, I am doing different test within different cut off intervals. For example, one test for genes up with log2FC between 0 and 0.5, then one for the genes up with log2FC between 0.5 and 1 and so on. And then separately for negative fold changes.
Later I do not mix results but I just compare the terms found enriched with different cut off intervals.
This way I check if genes that are significantly up or down regulated with similar intensity are enriched for specific terms/ontologies that might not be spotted if considering a single cut off value.
Hope I was clear
Pietro
Hi Shawn, thanks for your answer.
I know what GSEA is but I am not sure it can answer my questions for this particular hypothesis.
My question was more like "Do genes that are significantly up regulated within a specific FC interval show enrichment for some categories/ontologies/terms, compared to genes significantly up regulated within different FC intervals?".
Using random numbers and imagination for an example, let's say I have 70 genes that are significantly up regulated within a FC interval 1 to 1.5. Of these, 50 genes belong to ontology A. Within my universe/background (~ 15000 genes), there are 90 genes belonging to ontology A. Gene ontology enrichment testing of these 70 genes results in a very significant value for ontology A. Then I take all the significantly up regulated genes that have a FC > 1, which are, let's say, 1000 genes. Gene ontology enrichment testing using these 1000 genes results in no significance for ontology A.
I would like to understand if this makes sense from a biological perspective.
So it sounds like you're concerned about losing true signal by looking at too large a group of genes. My concern would be false positives/negatives based on the arbitrary nature of the cutoffs. I'd still argue in favor of GSEA over binning the genes by log2FC cutoffs.
In your example case you have 50/90 genes from ontology A that are between 1 and 1.5-fold upregulated. If this were the case then both your modified GO term enrichment and GSEA would find significance. A better question would be if you're still interested in this result if the other 40 genes are between 1 and 1.5-fold downregulated. From the method you've described, an ontology where genes are both up and down would be found significant in both analyses, whereas with GSEA it would not be significant.
One work around would be to filter out gene ontologies that are both up- and down-regulated, but is that fair? If you have an ontology that is 5-10 fold upregulated, and 0.5-1.0 fold downregulated, would you call that ontology upregulated or unchanged? GSEA would probably call that upregulated, but then what cutoffs are appropriate? What about 20-fold upregulated and 5-fold downregulated? Is that significant/meaningful?
I think you're asking an interesting question regarding ontologies within fold change intervals, I'd just be worried that the arbitrary nature of the cutoffs can lead to some false positives/negatives. If this is simply an in-house analysis for filtering or discovery work, then I don't see much harm. But if you're hoping to publish this analysis or downstream findings it might be difficult to defend this method to reviewers. GSEA using gene ontologies has its own drawbacks, but it's well established and accepted in the field, I can't imagine a reviewer giving it too much difficulty.
Arbitrary cut-offs can be misleading, therefore one also needs joined neighboring intervals in such a case to cover the maximum range of values (low to high expression) and then perform enrichment analysis. In my opinion, binning by Log2FC values certainly adds more relevance to the whole analysis by bringing the expression values in picture.