Question

What is the best backgrowun gene list to choose for GSEA, ORA, and for KOBAS

1

Entering edit mode

6 weeks ago

Pegasus ▴ 120

Hi everyone,

I have a couple of questions regarding best practices in enrichment analysis for RNA-seq data, particularly in microbial systems, based on the following paper and post:

Post: Methodological problems are extremely common for enrichment analysis - beware the pitfalls before you publish

Paper: "Urgent need for consistent standards in functional enrichment analysis"

As the paper notes:

"In the case of ORA for differential expression (e.g., RNA-seq), a whole genome background is inappropriate because, in any tissue, most genes are not expressed and therefore have no chance of being classified as DEGs. A good rule of thumb is to use a background gene list consisting of genes detected in the assay at a level where they have a chance of being classified as DEG."

Given that microbial RNA-seq data doesn’t involve tissue-specific expression, I’m curious about the most appropriate approach for defining the background set in this context.

1. Recommended Background for Microbial RNA-seq: Which of the following would be the best choice for the background gene set?

a. All genes

b. Genes filtered with TPM > a certain threshold (e.g., 10)

c. Genes filtered with CPM > 1

d. DEGs with Log2FC above/below a certain threshold

2. In case of GSEA and ORA, Should DEGs Be Separated by Regulation?

Use up- and downregulated DEGs together in one list
Separate them into two lists (upregulated and downregulated) and analyze them individually?

I’ve come across different perspectives on these points, but I’m still unsure about the best approach. Any guidance would be greatly appreciated!

Thanks in advance for your help!

RNA-SEQ • 719 views

ADD COMMENT • link 5 weeks ago by Pegasus ▴ 120

score 1 · Answer 1 · 2024-11-03

1

Entering edit mode

6 weeks ago

LChart 4.7k

The best gene set to use as the background is whatever gene set you put in for differential expression. In practice you get a matrix with (depending on your species) 10K-60K+ features, many of which will be identically 0 or mostly 0. These basically have no power, so you'll set a threshold on count, or CPM, or variability, or whatever. If you want to know where to draw the line perform a power analysis and pick the cutoff with a reasonable power (50%, 80%, etc). This will give you a 2K-16K+ matrix that you'll use for DEG or WGCNA or whatever.

This matrix is your background set.

As regards (2) - you should not be sub-setting genes for GSEA because this naturally changes the background set. If you want a signed analysis you can use logFC, and for unsigned you can use absolute logFC or log p-values. Beyond that, the question is whether you want a signed analysis (alternate hypothesis: gene set X is [over-expressed] [under-expressed]) or an unsigned analysis (alternate hypothesis: gene set X is [not similarly expressed]). Signed analyses are more interpretable, in general.

ADD COMMENT • link 6 weeks ago by LChart 4.7k

0

Entering edit mode

Thank you LChart for your clear reply.

Regarding the use of the goseq package (with gene length correction), several discussions have mentioned separating upregulated and downregulated genes, which allows us to examine the direction of regulation. This is why I’m still uncertain about how to approach the alternative methods, such as GSEA (Gene Set Enrichment Analysis) or ORA (Overrepresentation Analysis). When we analyze both upregulated and downregulated genes together in one shot, we may observe an enriched GO term. However, in this context, how can we accurately interpret the direction of transcript abundance in our discussion?

Thank you in advance for any input!

ADD REPLY • link 5 weeks ago by Pegasus ▴ 120

1

Entering edit mode

What you are describing is a "signed analysis." For GSORA this is achieved by choosing significant, up-regulated (or, separately, significant, down-regulated) genes as the "positive set" over the background. For GSEA this is achieved by using logFC or signed pvalue ranks rather than unsigned p-values.

ADD REPLY • link 5 weeks ago by LChart 4.7k

0

Entering edit mode

Just to clarify the point discussed above, since there's still significant debate about whether to separate or not in both goseq and GSEA/ORA methods, is it common practice to separate upregulated and downregulated genes?"

Thanks

ADD REPLY • link 5 weeks ago by Pegasus ▴ 120

2

Entering edit mode

GSEA: Do not separate up-regulated and down-regulated genes. If you do this, you're asking the question "Among genes that are down-regulated, is the downregulated component of pathway X more down-regulated than the overall downregulated component?" do you understand why this question has little biological meaning?

GSORA: DO separate up-regulated and down-regulated genes. When you do this, you're asking "Among all genes, does pathway X enrich for up-regulated genes?"

ADD REPLY • link 5 weeks ago by LChart 4.7k

0

Entering edit mode

Thank you so much LChart , in such case, I believe both (GOseq) and KEGG pathway analysis are following the ORA approach and so we have to seperate the up/down regulation.

Cheers,

ADD REPLY • link 5 weeks ago by Pegasus ▴ 120