I have been looking all over the web to find some answers to my problem but unfortunately, I was unsuccessful. I wish to determine whether an a priori defined set of genes in my case genes associated with Epidermolysis bullosa shows statistically significant, concordant differences between two biological states (e.g. phenotypes).
I have a list of genes from my own experiment and a predefined list of genes associated with Epidermolysis bullosa, and want to find out if the genes of Epidermolysis bullosa are overrepresented in my list.
Here is an article about GSEA, which is related to what I want to do: https://www.pnas.org/doi/epdf/10.1073/pnas.0506580102
so I have the following data: I'm researching patients with severe or mild disease; in this research, I have found a list of 200 genes after conducting a SKAT test in R (https://www.rdocumentation.org/packages/SKAT/versions/2.2.4/topics/SKAT) with a p-value <0.05 ( which I presume can cause a severe disease instead of mild ) I want to focus on this 200 significant genes that can perhaps affect the disease level, this genes can be -> the L list as described in the article above. Also, the S list of predefined genes ( as is called in the article above )is associated with Epidermolysis bullosa disease. and I want to know if the genes from the Epidermolysis bullosa are overrepresented in my L list. After reading the article and many more resources, I understood that to perform GSEA (https://www.gsea-msigdb.org/gsea/index.jsp), I had to have gene expression data which i don't have .
so I wish to understand if there is a tool that could do a gene set comparison without having the expression data whether it's in R or Python that uses my gene list and predefined list ( in my case associated with Epidermolysis bullosa) ,thank you :)
Papyrus thank you , and if I have more then one predefined set of genes I want to test besides Epidermolysis bullosa, to check if genes from my experiment are enriched with thm the tools you listed would work ?
Yes, I mean you could always do a loop and use multiple Fisher's/hypergeometric tests, but these tools are specialized for that task and provide a lot of functionalities. Be sure to check the vignettes. For example, the
enricher
function inclusterProfiler
let's you do this if you construct a customTERM2GENE
annotation:Papyrus thank you very much for your replies, just one clarification in the function you provided :
here the gene=letters are the genes with pvlues <0.05? and universe in the N ? as here : N <- 10000 # This should be the number of genes in your "background/universe" (probably all of the genes for which you perform the SKAT test; i.e., all the genes which had a chance of appearing as significant) n <- 200 # This should be the number of your 200 genes with SKAT p < 0.05, as I uderstant it also performs hypergeometric test?
Yes!
gene
are your significant genes (your genes of interest,n
) anduniverse
are all the genes in your analysis (all the genes which had a change of appearing as significant,N
). And yes, it will perform hypergeometric tests for all the pathways that are described inTERM2GENE
.Papyrus thank you very much , I run the analysis and i was wondering i got the following results :
eczema got a segnificant p vlaue but when i run it for example without psoriases the pvablue becomes not senificant >0.05 , cant seem to understand why ?
What do you mean by running it "without psoriasis"?, I don't understand. Also, if the "BgRatio" column means what I think it means, almost all of the 5426 genes you profiled are classified as belonging to "eczema", so the enrichment is probably not very strong.
Papyrus I tested if genes from my experiment are enriched with genes from various skin diseases as I understand the eczema p-value is <0.05 does it mean that my genes are enriched with genes associated with eczema ? I also did the regulat hypergeometric test and got a segnificant result :
If would seem that way, if the provided numbers are correct. Are all the 5157 genes associated with eczema present in your background of 8873 genes?
Papyrus yes this is the intersection intersect(eczema_id$SYMBOL,genes$Gene.refGene_ANNOVAR) which is the K in the test
Hmm, however I'm not quite sure how, if you have 5157 genes (K) intersecting your background (N), in the
enricher
results you show, it states that BgRatio (which I think is K/N) is 5320/5426. Check that you're imputing the same data intoenricher
that you use in the separate Fisher test, and that you're doing everything correctly (no duplicated genes, etc.). The numbers shown byenricher
could be lower than in your data because it may filter everything for those genes present in the pathways you submit, but I don't think they should be higher.