Hi everyone,
I have a couple of questions regarding best practices in enrichment analysis for RNA-seq data, particularly in microbial systems, based on the following paper and post:
Post: Methodological problems are extremely common for enrichment analysis - beware the pitfalls before you publish
Paper: "Urgent need for consistent standards in functional enrichment analysis"
As the paper notes:
"In the case of ORA for differential expression (e.g., RNA-seq), a whole genome background is inappropriate because, in any tissue, most genes are not expressed and therefore have no chance of being classified as DEGs. A good rule of thumb is to use a background gene list consisting of genes detected in the assay at a level where they have a chance of being classified as DEG."
Given that microbial RNA-seq data doesn’t involve tissue-specific expression, I’m curious about the most appropriate approach for defining the background set in this context.
1. Recommended Background for Microbial RNA-seq: Which of the following would be the best choice for the background gene set?
a. All genes
b. Genes filtered with TPM > a certain threshold (e.g., 10)
c. Genes filtered with CPM > 1
d. DEGs with Log2FC above/below a certain threshold
2. In case of GSEA and ORA, Should DEGs Be Separated by Regulation?
Use up- and downregulated DEGs together in one list
Separate them into two lists (upregulated and downregulated) and analyze them individually?
I’ve come across different perspectives on these points, but I’m still unsure about the best approach. Any guidance would be greatly appreciated!
Thanks in advance for your help!
Thank you LChart for your clear reply.
Regarding the use of the goseq package (with gene length correction), several discussions have mentioned separating upregulated and downregulated genes, which allows us to examine the direction of regulation. This is why I’m still uncertain about how to approach the alternative methods, such as GSEA (Gene Set Enrichment Analysis) or ORA (Overrepresentation Analysis). When we analyze both upregulated and downregulated genes together in one shot, we may observe an enriched GO term. However, in this context, how can we accurately interpret the direction of transcript abundance in our discussion?
Thank you in advance for any input!
What you are describing is a "signed analysis." For GSORA this is achieved by choosing significant, up-regulated (or, separately, significant, down-regulated) genes as the "positive set" over the background. For GSEA this is achieved by using logFC or signed pvalue ranks rather than unsigned p-values.
Just to clarify the point discussed above, since there's still significant debate about whether to separate or not in both goseq and GSEA/ORA methods, is it common practice to separate upregulated and downregulated genes?"
Thanks
GSEA: Do not separate up-regulated and down-regulated genes. If you do this, you're asking the question "Among genes that are down-regulated, is the downregulated component of pathway X more down-regulated than the overall downregulated component?" do you understand why this question has little biological meaning?
GSORA: DO separate up-regulated and down-regulated genes. When you do this, you're asking "Among all genes, does pathway X enrich for up-regulated genes?"
Thank you so much LChart , in such case, I believe both (GOseq) and KEGG pathway analysis are following the ORA approach and so we have to seperate the up/down regulation.
Cheers,