I ran a KEGG enrichment analysis for co-expression clusters on a non-model organism using enrichKEGG. The list of top 5 enriched pathways looks good and seems biologically plausible in the context of our experiment, but the numbers in GeneRatio and BgRatio are confusing.
universe_set -- a vector of all unique KOs annotated with KofamKOALA for my de novo transcriptome. gene_set -- a vector of all unique KOs for my gene set (transcripts from one of the co-expression clusters)
Test code:
# TEST CODE
gene_set <- c("K23540", "K23487", "K22188")
length(gene_set) # 3
universe_set <- c("K23540", "K23487", "K22188", "K06843", "K04678", "K17601", "K22037", "K23897", "K04675", "K04437", "K09191", "K00476", "K23437", "K12487", "K09292", "K21469", "K06639", "K20523", "K07366", "K14688", "K04811", "K24738", "K12026", "K25615", "K10049", "K24496", "K06238", "K22647", "K19716", "K17613", "K13100", "K12261", "K22156")
length(universe_set) # 33
kegg_kofam <- enrichKEGG(gene = gene_set,
universe=universe_set,
organism="ko",
keyType='kegg',
pAdjustMethod = "BH",
minGSSize = 1,
maxGSSize = 1000,
pvalueCutoff = 1,
qvalueCutoff = 1,
use_internal_data = F)
kegg_kofam_DF <- kegg_kofam[,1:9]
My gene_set vector has length 3 and my universe_set vector is 33. gene_set vector is part of universe_set.
And I expected to get GeneRatio k/3 and BgRatio M/33, but I get GeneRatio k/1 and BgRatio M/17.
Where could there be an error or snag?
The effective universe is the intersect between
universe
and the genes annotated in the pathway database. That is apparently just 17 out of 33 genes.