Question

No over-represented GO terms, but plenty from GSEA?

0

Entering edit mode

6 weeks ago

bioinfo2345 ▴ 40

Hi,

I am doing a functional analysis of a differential expression analysis on a non-model organism. I have one experimental group and one control group with four replicates each.

The number of differentially expressed genes I get is around 2000 (FC > |1| and adjusted p-value < 0.05) and they have about 3500 GO terms associated with them.

When I do an enricher GO over-representation analysis, I use the following code:

enricher_10 <- enricher(diff_exp_genes_names, 
                    pvalueCutoff = 0.05,
                    pAdjustMethod = "BH",
                    universe = NULL,
                    minGSSize = 10,
                    qvalueCutoff = 0.2,
                    gson = NULL,
                    TERM2GENE=go_data,
                    TERM2NAME=NA)

The diff_exp_genes_names is a vector of custom gene IDs:

 chr [1:2021] "Pr_t00014577-TA" "Pr_t00019548-TA" "Pr_t00018659-TA" ..

and my TERM2GENE file has this general format:

                GOs      query_name
        <char>          <char>
 1: GO:0005275  Pr_t00014577-TA
 2: GO:0005222  Pr_t00018659-TA
 3: GO:0005323  Pr_t00018659-TA

Here I get no statistically significant GO terms after adjustment for multiple testing, but many are statistically significant before correction, so the actual analysis is carried out.

When I do GSEA with this code:

GSEA_results_fgsea_minGSSize_10 <- GSEA(gene_list_final, 
                                    exponent = 1, 
                                    minGSSize = 10,
                                    maxGSSize = 500, 
                                    eps = 1e-10,
                                    pvalueCutoff = 0.05, 
                                    pAdjustMethod = "BH",
                                    gson = NULL, 
                                    TERM2GENE=go_data, 
                                    TERM2NAME = NA, 
                                    verbose = TRUE,
                                    seed = FALSE, 
                                    by = "fgsea")

I get plenty of statistically significant results.

The gene_list_final looks like this:

Named num [1:12510] 7.35 6.99 6.7 6.66 6.6 ...
- attr(*, "names")= chr [1:12510] "Pr_t00018699-TA" "Pr_t00007428-TA" "Pr_t00004980-TA" "Pr_t00004978-TA" ...

Question 1: Have I done the GO over-representation analysis incorrect?

Question 2: Does the GO over-representation analysis not produce any statistically significant results because the number of GO terms are too many?

Question 3: Should I use a more stringent cut-off for FC and adjusted p value to get a list of differentially expressed genes that are easier to analyze and handle? How do you analyze such large lists of genes (circa 2000 genes)?

enricher clusterprofiler • 284 views

ADD COMMENT • link updated 6 weeks ago by dthorbur ★ 2.5k • written 6 weeks ago by bioinfo2345 ▴ 40

score 0 · Answer 1 · 2024-10-23

Question 1: I've not used enricher, so I can't comment on that aspect.

Question 2: You have quite a large number of genes in your input list. 2k transcripts will make up a sizable portion of the transcriptome in almost all species. I suspect any results are being lost due to p-value correction of a large list of GO terms.

Question 3: Your suggestions sound reasonable. Increasing FC is a good start. But also to test your pipeline is actually working, you could randomly select 100 transcripts instead, and you're likely to get something significant. Or take 20 transcripts associated with a specific trait or pathway and see if you can detect that signal. I would struggle to analyse a list of 2000 genes. If I ended up in that situation, I would rethink my question or thresholds.