I have a qustion, what is the best combine score (adj p.value and z score) for enrichment analysis from Enrichr, exactly?
I have a qustion, what is the best combine score (adj p.value and z score) for enrichment analysis from Enrichr, exactly?
I've found Enrichr to be useful, and I can say that the tables are scored by the combined score and there are a fair number of experiments that identify relevant categories among the top ~10 gene sets with at least one reference set (ChEA 2016, GO, KEGG, etc.).
I also see that Enrichr website lists two publications: Chen et al. 2013 and Kuleshov et al. 2016.
In terms of answering your question about how the combined score is calculated, Chen et al. 2013 describe the combined score is described as c = log(p) * z
, where c
= the combined score, p
= Fisher exact test p-value, and z
= z-score for deviation from expected rank. So, I think that is how the combined score is being calculated. In a downloaded table for the example differentially expressed gene list, I can replicate that calculation (if using the natural log for transforming the unadjusted p-value).
In terms of the z-score calculation, the In Enrichr: What is "Gene weight" or "levels of membership"? Biostar discussion helped me see this Help Center that describes calculation of the background for the z-score from a pre-defined table with random gene lists.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Do you know by any chance how cases in which the z-score is NA are treated before calculating the combined score? are they converted to '1' ? I checked the Enrichr papers but couldn't find a clear explanation on this.
It has been a little while since the last post, so I apologize that I might not have the most precise answer (in addition to not developing the program).
I can see that I was able to reproduce the calculation under certain conditions.
You don't upload a background list. If you added a filler value for a z-score, then I think 0 would make more sense than 1. At least that would be my guess.
So, off the top of my head, I apologize that I don't remember where that confirmation was saved. It looks like I did in fact use the demo example, but I think having the specific numbers might have helped be able to better answer that question at a later time. For example, I would have to look into this more in order to remember where the z-score is coming from. I see something about a "precomputed" value, and I remember some of the details being spread out. However, I apologize that I don't remember exactly where this was coming from at this time.
Categories with 0 overlap are not provided within the results, so you don't have to worry about that part.
UPDATE: I am not sure if this is what I meant before, but I tested trying to calculate what the z-score should be (as the combined score / ln(unadjusted p-value)).
I noticed that this matches the negative odds-ratio. The numbers are so close that I don't think it is a coincidence. This is a bit different than I expected for a z-score, but I don't think this should be an "NA" value (since filtered results are returned).
Do you think this might answer your question?
Thanks for brainstorming about it =) Well, the reason why I am asking is that I am trying to translate the concept of combined score on an enrichment analysis obtained from Ingenuity Pathway Analysis, which is more convenient to me but doesn't provide the c-score, so I need to make the calculation by myself. I would turn z-score=NA into 1s because in this way I can keep the information about enrichment, acknowledging there's no information about its direction.
You probably already realize this, but the z-score represents something different in IPA:
http://pages.ingenuity.com/rs/ingenuity/images/0812%20upstream_regulator_analysis_whitepaper.pdf
I think there was some simplified version of the IPA z-score that I could calculate to (approximately) verify, but I think the full information needed to re-calculate that might not be available for you.
Also, perhaps most importantly, the IPA z-score looks at the expected direction of change, and that is not directly considered by Enrichr (although a subset of gene sets might only contain genes with one direction of change).
In IPA, I typically look at both the z-score and p-value as independent methods. Sometimes I think one strategy works better and sometimes I think the other strategy works better. I am not sure how some sort of average would turn out, but varying the focus with different projects is what I would do.
So, if you included something, I think a non-significant IPA z-score should be 0 rather than 1. However, I omit the samples missing z-scores when I use that as the sorting method.
For the subset of public gene sets with a consistent direction, I have used BD-Func to compare those:
https://sourceforge.net/projects/bdfunc/
https://peerj.com/articles/159/
That BD-Func score is often just a t-test statistic between the 2 "activated" versus "inhibited" gene sets, so you might not necessarily need to use that same interface. However, if either the program or supplementary files are helpful ( as an open-source option), then I would certainly be happy to see you use them.
I realize this is a little different (you are trying to revise the IPA analysis, rather than use an open-source alternative to the IPA z-score). However, these are my thoughts, and I hope they can be helpful.