I have a metagenomics dataset with de-novo open reading frames [ORF] calls analogous to genes. In this dataset, I have metagenomic assembled genomes [MAGs] and for each MAG I have clusters of ORFs. I have ran KOFAMSCAN to get KEGG orthologs for the ORFs.
I want ask the following statistical question:
For each MAG-specific cluster, which KEGG pathways are enriched?
I've already extracted all of info from this KEGG htext file so I have a gmt formatted table that basically has the following structure: [KEGG_ID]\t[KEGG_PATHWAY_DESCRIPTION]\t[KO_1]\t[KO_2]\t...[KO_N]
(though, I also have this in a python dictionary {"kegg_pathway":{KO_1, KO_2, ..., KO_N}
).
Additionally, I have a vector of positive floats that I could also use for each of the KO specific to each cluster.
My first idea was to try GSEA via GSEApy using prerank
module. I got this to run but realized it wants positive and negative weights so I don't think I'm using this appropriately. It seems to be specific to phenotypic data and I only have one condition.
My next idea was to try to use a hypergeometric test, though I'm not sure how this can be done exactly: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.hypergeom.html
For example if I went down this path, would I do this for multiple iterations and what would I use for the M
, n
, and N
Can someone direct me to a software package (preferably Python or command line but R is fine if the implementation is straightforward), how to use the hypergeometric test properly, or a better way to go about solving this problem?
A recap of my dataset:
KEGG_DATABASE
- Mapping between KEGG pathways to a list of KEGG orthologs in said pathway (GMT file and Python dictionary)
METAGENOME
- MAGs
- Cluster of ORFs in MAG
- KEGG ortholog mapping for ORFs in cluster
- Vector of positive floats with "weight" for KEGG ortholog w.r.t. each MAG or Cluster (the details aren't important)
- MAGs
Thanks for the response. Expanding this out a bit. We have the following:
MAG_1
has 1000 genesgene_set_A
has 50 genesquery_cluster
has 20 genesquery_cluster
are ingene_set_A
In the example here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.hypergeom.html
Would the way to generalize this to our problem be the following:
The survival function
sf
might be more straight forward in this case, since it will calculate the probability of k or more successes, which is identical to the p-value of the upper-tail.