No over-represented GO terms, but plenty from GSEA?
1
0
Entering edit mode
6 weeks ago
bioinfo2345 ▴ 40

Hi,

I am doing a functional analysis of a differential expression analysis on a non-model organism. I have one experimental group and one control group with four replicates each.

The number of differentially expressed genes I get is around 2000 (FC > |1| and adjusted p-value < 0.05) and they have about 3500 GO terms associated with them.

When I do an enricher GO over-representation analysis, I use the following code:

enricher_10 <- enricher(diff_exp_genes_names, 
                    pvalueCutoff = 0.05,
                    pAdjustMethod = "BH",
                    universe = NULL,
                    minGSSize = 10,
                    qvalueCutoff = 0.2,
                    gson = NULL,
                    TERM2GENE=go_data,
                    TERM2NAME=NA)

The diff_exp_genes_names is a vector of custom gene IDs:

 chr [1:2021] "Pr_t00014577-TA" "Pr_t00019548-TA" "Pr_t00018659-TA" ..

and my TERM2GENE file has this general format:

                GOs      query_name
        <char>          <char>
 1: GO:0005275  Pr_t00014577-TA
 2: GO:0005222  Pr_t00018659-TA
 3: GO:0005323  Pr_t00018659-TA

Here I get no statistically significant GO terms after adjustment for multiple testing, but many are statistically significant before correction, so the actual analysis is carried out.

When I do GSEA with this code:

GSEA_results_fgsea_minGSSize_10 <- GSEA(gene_list_final, 
                                    exponent = 1, 
                                    minGSSize = 10,
                                    maxGSSize = 500, 
                                    eps = 1e-10,
                                    pvalueCutoff = 0.05, 
                                    pAdjustMethod = "BH",
                                    gson = NULL, 
                                    TERM2GENE=go_data, 
                                    TERM2NAME = NA, 
                                    verbose = TRUE,
                                    seed = FALSE, 
                                    by = "fgsea")

I get plenty of statistically significant results.

The gene_list_final looks like this:

Named num [1:12510] 7.35 6.99 6.7 6.66 6.6 ...
- attr(*, "names")= chr [1:12510] "Pr_t00018699-TA" "Pr_t00007428-TA" "Pr_t00004980-TA" "Pr_t00004978-TA" ...

Question 1: Have I done the GO over-representation analysis incorrect?

Question 2: Does the GO over-representation analysis not produce any statistically significant results because the number of GO terms are too many?

Question 3: Should I use a more stringent cut-off for FC and adjusted p value to get a list of differentially expressed genes that are easier to analyze and handle? How do you analyze such large lists of genes (circa 2000 genes)?

enricher clusterprofiler • 284 views
ADD COMMENT
0
Entering edit mode
6 weeks ago
dthorbur ★ 2.5k

Question 1: I've not used enricher, so I can't comment on that aspect.

Question 2: You have quite a large number of genes in your input list. 2k transcripts will make up a sizable portion of the transcriptome in almost all species. I suspect any results are being lost due to p-value correction of a large list of GO terms.

Question 3: Your suggestions sound reasonable. Increasing FC is a good start. But also to test your pipeline is actually working, you could randomly select 100 transcripts instead, and you're likely to get something significant. Or take 20 transcripts associated with a specific trait or pathway and see if you can detect that signal. I would struggle to analyse a list of 2000 genes. If I ended up in that situation, I would rethink my question or thresholds.

ADD COMMENT

Login before adding your answer.

Traffic: 1987 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6