My goal is to evaluate the Sensitivity/Specificity of an indel detection method.
I have a "gold standard" VCF file (ref.vcf) that states where are exactly the insertions and deletions in my genome. And of course, my indel detection method produces its own VCF file (let's call it test.vcf).
To calculate the True Positives, I detect the intersection of test.vcf and ref.vcf (I use exact intersection for the sake of simplicity for now). The False Positives, are the features in test.vcf that are not in ref.vcf. And False Negatives are the features in ref.vcf that are not in test.vcf.
But how would you calculate the True Negatives? I just can't use the number of positions left (too big number!).
Why is the number too big? From my understanding, you have a number of positions that say "nope, no indel here," which is probably the majority of them. For these positions, if there really isn't an indel there, shouldn't that be a true negative? Assuming similar data, you should have mostly true negatives.
Pascal is correct, the whole number of TN is too large (~3.3e9 for human) such that the figures will be misleading (and drown in rounding error). Therefore it is common practice not to use the standard way of defining specificity like that.
You can use the the Positive Predictive Value (thanks Casey for clearing the definition up).
PPV = TP/(TP + FP)
instead of the Specificity:
Sp = TN/(TN+FP)
This has been used in eukaryote gene-prediction where you have a similar case, if you look for coding-regions on a per nucleotide basis, assuming a vast proportion of the genome is not coding. It has the advantage of avoiding the extremely large TN values leading to close to Sp ~ 1 for most cases.
As you probably know, the genome-wide "specificity" value you refer to is more properly called positive predictive value (PPV http://en.wikipedia.org/wiki/Positive_predictive_value). The (mis)use of the term specificity for PPV causes no end of confusion among students (and researchers). I've found it is better to avoid using the terms sensitivity/specificity and use recall/precision instead, since they are not ambiguously defined.
So? Actually I found this definition from the book 'Zvelebil, Understanding Bioinformatics' I will look this definition up tomorrow, and see if they got it right or are themselves source of confusion or if I am. And, 'causes no end of confusion', now you are a bit exaggerating, aren't you? But I will correct it and call it PPV then.
The reason this is important is that terms must be precise to have meaning. I wouldn't doubt it if Zvelebil is wrong on this, it happens in many places. FYI, see wikipedia for the formal classification of performance-related terms: http://en.wikipedia.org/wiki/Sensitivity_and_specificity#Worked_example
I have now checked the text in the textbook "Understanding Bioinformatics", 1s edition (2007, maybe corrected by now?), by Zvelebil & Braun. On p. 365 they use the exact misnomer I was reproducing. They propose PPV and introduce it as specificity, while mentioning a standard definition of specificity (same es Sp in my text) without giving references.
I agree with the comment above, that number really id your True Negative count. And yeah, it will be an absurdly large number depending on your dataset. What you will want to do is look beyond simply calculating sensitivity and specificity. In cases where you have an unbalanced number of entries per class (indel no-indel in this case) you want to start looking at something like the F1-score or the Matthews Correlation Coefficient as a better summary statistic for your comparisons.
Something else to analyze the data is to contruct ROC or Precision-Recall curves so you can see how the specificity and sensitivity are interacting with one another.
It is true that it might be good to look at other measures, but it is also true that it is possible to work with them because there is a way around the large counts, and it is good to use such measures as Sp and Se because they are so well established.
Why is the number too big? From my understanding, you have a number of positions that say "nope, no indel here," which is probably the majority of them. For these positions, if there really isn't an indel there, shouldn't that be a true negative? Assuming similar data, you should have mostly true negatives.
Pascal is correct, the whole number of TN is too large (~3.3e9 for human) such that the figures will be misleading (and drown in rounding error). Therefore it is common practice not to use the standard way of defining specificity like that.