I am working with a gene priorization program and I want to analyze its performance by making use of ROC curves and the ROCR package for R. The problem is that the program does not give me a score for each gene, it only orders the genes.
ROCR only uses continuous data as predictions. Is it possible to assign to each gene a number in descending order? For example, I have these genes ordered:
- Gene A
- Gene B
- Gene C
I could assign these values?
- Gene A: 3
- Gene B: 2
- Gene C: 1
That's odd that the program doesn't give you a numerical value. Without a numerical value you have no way of knowing whether two genes were tied on a prioritization assessment. So you likely will have to just assume no ties. I think the most explainable way is what you propose, that is you construct the ROC curve based on the rank of the gene in the list. The absolute value of a score doesn't really impact the ROC curve, as you could multiple all scores by some factor X and still get the same ROC curve.
Could you tell us what the gene prioritization method is? I would double check, that the gene prioritization method does not give you a numerical score and that it is not doing a gene set analysis (where there is no meaning in the order of the gene) as opposed to prioritization.
The program is DADA (http://compbio.case.edu/omics/software/dada/) and I think it does a real gene prioritization because it ranks your candidate genes file. Moreover, seed genes are usually at the top of the list. But you can take a look, I would be very grateful.
Yep it is a network-based gene prioritization approach. Based on their paper, it does create a score that it uses to rank the genes. However, from their user documents they don't provide it as an output. So from a practical perspective, like Jean-Karim noted, you likely have to use the rank of the gene as a score so that packages like ROCR will generate a ROC curve for you.
Perfect, thanks! Could you recommend any gene prioritization tool which provides scores in the output? I am using DADA because it allows you to load a protein-protein interaction network and is easy to use. It works well but it will be useful to prove other programs.