First, a quick summary of sensitivity, specificity and accuracy; then some thoughts as to whether they apply in your case. I've linked here to Wikipedia entries, which are very good summaries of this topic.
To calculate sensitivity, specificity and accuracy, you require 4 things:
- True positives (TP). A true positive is an observation, identified by your algorithm, which is a real instance of a feature.
- False positives (FP). A false positive is when your algorithm identifies an observation as a real instance of the feature but in fact, it is not.
- True negatives (TN). A true negative is the case where the observation is not a real instance of the feature and your algorithm identifies it as such.
- False negatives (FN). A false negative is the case where your algorithm does not identify the observation as a real instance of the feature but in fact, it is.
Sensitivity is then:
TP / (TP + FN)
And specificity is:
TN / (TN + FP)
Accuracy does not have a single, concise definition - it's measured in many ways, using various combinations of TP, FP, TN and FN. One measure is:
TP + TN / (TP + FP + TN + FN)
Another metric is positive predictive value, defined as:
TP / (TP + FP)
Your problem then, is to define what constitutes TP, FP, TN and FN. You can see that this is difficult to do when the algorithm is fuzzy pattern matching between strings. A perfect match to a known motif might be a TP, but what about a partial match - is that a TP? Similarly, what is a TN in this case? No matches when a known motif is not present? Clearly, the definition of a match changes depending on the query and target sequences. So I don't think that specificity and sensitivity are very useful measures in your case.
Since what you are doing sounds very similar to BLAST (and other pattern matching algorithms), I suggest you consult the literature and see how performance of those algorithms was evaluated.
Do you mean like BLAST?
do you have a gold standard to evaluate against?
Do you mean like grep/agrep? ;)
@niallhaslam: Similar to Blast, yes. @Casey: No, I'm actually looking for something like that, if it exists.
Mike, if you say similar to blast (a local alig), do you want do a sequence alignment? Note that alignment is different from pattern matching, please specify what you are going to achieve correctly, otherwise you will fail with your implementation attempt anyway. If dealing with global/local-alignments, then your 'gold-standard' is Needleman-Wunsh (global) or Smith-Waterman(local).
Please read this first http://en.wikipedia.org/wiki/String_searching_algorithm vs. http://en.wikipedia.org/wiki/Sequence_alignment
Thanks for the comments, Michael. My end goal is local alignment. I'm aiming to use the fuzzy matching to essentially give me the initial locations where I can use some method to find good local alignments. (Local DP comes to mind.)
Also, Michael, I'm concerned about S-W as a "gold standard". Yes it is optimal, but I'd like to run the algorithm on a long (perhaps genome-size) text and small query.
I do hope to have faster performance (although less accurate) with my algorithm though.