Enrichment Analysis Based On Quantitative Annotation Scores
3
6
Entering edit mode
13.1 years ago
Andrew Su 4.9k

Annotation enrichment analyses (like GSEA) are of course very common these days for the analysis of genome-scale data. However, they typically are based on qualitative (and absolute) gene annotations. For example, the gene CDK2 is involved in cell cycle with no ambiguity or uncertainty.

Is anyone aware of enrichment approaches that are based on quantitative confidence scores? Such a method would be able to intelligently use data that said CDK2 is involved in cell cycle with >> 99% certainty, whereas TBL1X is associated with autism at only 25% confidence.

Ignore for the moment where exactly those confidence scores might come from, but one can imagine that they might come from some text mining process. Any thoughts or leads?

enrichment statistics • 3.2k views
ADD COMMENT
2
Entering edit mode
13.1 years ago

Some of these terms don't seem to lend themselves to quantification, a gene is either involved in cell cycle or not (and of course it all depends on what 'involved' means). A number next to this would at best represent the information/knowledge available to the individual making the statement - but how could that be a generic concept that applies to everyone?

Perhaps it is the word involved that serves as the actual quantification. When people have only a hunch (10-25% certainty) then they call it involved, if it is 25%-50% we are in the associated territory and so on, even higher and we get into the stronger terms implying causation.

ADD COMMENT
0
Entering edit mode

Though I'm not sure I agree that "involved" and "associated" differ by a level of confidence, your point is well taken about how one would interpret a quantitative score. This perhaps is a limitation of trying to boil down a rich picture of biological knowledge down to a structured gene annotation...

ADD REPLY
2
Entering edit mode
13.1 years ago

I am not aware of any methods that assign a quantitative value to the likelihood that a gene is assigned to a GO term. I think the reason this is the case is that tool developers assume that uncertainty should be quantified at the set level, not the gene level.

The only work that I am aware of where a quantitative value is assigned to genes in GO analysis are methods that attempt to correct for non-uniformity in genomic locus length when assigning genomic features to GO categories:

These are not exactly what you are looking for, but they might be a starting point to think about assigning a variable weight to genes in GO analyses.

Picking up on @Istvan's comment, even if you did have a method, where would these values come from? One place to look is in the inherent uncertainty reported in the literature in terms of contradictions. For example, you could assign the ratio of positive to negative mentions of a gene-GO link using contradiction mining, as has been recently done in the bioNOT system: http://www.ncbi.nlm.nih.gov/pubmed/22032181

ADD COMMENT
0
Entering edit mode

Great references, thanks. You're right that they aren't directly applicable, but necessary reading if we end up trying to implement something. Regarding your last question/comment, yes, text mining is a natural place where one might end up with confidence scores...

ADD REPLY
1
Entering edit mode
13.1 years ago
Qdjm 1.9k

Three thoughts:

  1. If you want to test for enrichment of an annotation in a gene list (e.g. Fisher's exact test) and if your confidence measures can be interpreted as probabilities that the annotation is correct, you could sum these probabilities up for all the genes in the gene list and compare this sum to what you would expect from a random subset of the same size from the background set. I bet that the null distribution is normal and a Chi-squared test is appropriate but you'd have to check both these guesses. The interpretation of this is that the sum of the gene list probabilities is the "expected number of genes with that annotation" in the gene list.

  2. Again, if your confidence measures are probabilities, you can sample annotations (i.e. give TBL1X an autism annotation with probability 25%) and then apply your favourite enrichment test. Then resample and redo it, etc. Not sure how to combine the P-values -- maybe take the median? -- but to be rigourous you should calculate the null distribution of the P-values through randomization.

  3. If the confidence measures are not probabilities, you could try calculating a Spearman correlation between the quantitative annotation score and and quantitative measurement associated with each gene.

ADD COMMENT
0
Entering edit mode

nice ideas, thanks qdjm... I like #1 the best -- that was closest to what we were thinking. Another variant of that is to do a GSEA-like approach in which one tests enrichment at every possible probability threshold and then takes the more significant p-value. Anyway, thoughts much appreciated...

ADD REPLY
0
Entering edit mode

Hi Andrew, thanks for the comment! Be careful with thresholding because it can give you different answers than sampling or taking expectations if your low and high confidence annotations are from different populations of genes. You could also use an approach like #1 for GSEA, I think that you just need to multiply the summands for P_hit (in eqn 1 in the appendix of the PNAS2005 paper) by the probability for gene j, and do the same for N_r and let N_h be the sum of all probabilities. Someone must have tried this already. Anyways, happy to chat about this offline.

ADD REPLY

Login before adding your answer.

Traffic: 2522 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6