Hi all,
I need your help since I am very much in doubt about the right approach for gene enrichment analysis.
For my non-model organism, I have a background list of official gene IDs (originating from a Trinotate annotated de novo transcriptome, which also assigned GO-terms to each gene) and a list of differentially expressed genes (official gene IDs) that I want to do a functional enrichment analysis with.
My first approach was using the R-package clusterProfiler since I have the GO-terms from the Trinotate annotation. But the statistics seemed off (FDR, Bonferroni, Benjamini etc. all came out with the same values).
My current approach is DAVID - but I am very much in doubt about what is the "right thing" to do!
The background gene list and list of differentially expressed genes are converted from official gene IDs into Entrez IDs. But each gene is assigned multiple Entrez IDs.
I have tried running the full converted list with multiple Entrez IDs per gene - and I have removed "duplicates" and run the analysis with only one representative Entrez ID per gene. The result is very different both in the number enrichment results and GO terms.
I would very much like to hear what are your approach to DAVID - would you remove "duplicates" (e.g. only use one representative Entrez ID per gene symbol) or use the full converted list?
Best wishes, Birgitte
It seems to me that at least part of the problem is that you're dealing with disparate annotation resources. If an annotated reference genome is available (for example in Ensembl), I would suggest to use it to define the gene set and map all your data to this gene set, not by ID conversion but by actually mapping sequences or you can use your trinotate annotations as reference and map all your other sequences to it then call the differentially expressed genes from there. You then collect GO annotations for the reference gene set and do enrichment analysis. Don't rely on tools already having preloaded GO annotations for your organism because they may be using a different reference gene set. The problem with ID conversion is that you don't always know how it's done (different resources versions, indirect mapping via unidentified third parties...). In your case, the problem is that redundant entries will bias the statistics but by selecting only one representative entry, you could be selecting an entry with different annotations from the others which would lead to different results depending on the set you select. Using a reference annotated genome makes the whole analysis consistent and reproducible.
Hi Jean-Karim, Thank you for your response! There is no reference genome available for the species I am working with, that is why I am using a Trinotate annotated gene list as background. But yes, the redundancy in the gene conversion makes it troublesome. Best, Birgitte