I've got a lot of genetic sequences and I'd like to use various clustering algorithms on them, but this requires a measure of the distance between two sequences. I've used BLAST+ and the Needleman-Wunsch algorithm, but both give measures of similarity rather than distance (i.e. two similar sequences have a large similarity score, and a low distance). I've found methods for whole genomes, but here I just want it for pairs of genes.
Is there a good way to get distance from similarity score? Ideally I'd like something where two identical sequences have a distance of zero and distance from sequence A to sequence B is the same as B to A.
Or is there some other method that finds a distance directly from the sequences (without first computing similarity)?
There's a variety of ways I can think of combining similarity scores to give something a bit like a distance (e.g. D(A, B) = 1 / similarity(A, B), or D(A, B) = min(sim(A,A), sim(B,B)) / sim(A,B) - 1) but I'm sure someone must have done this before and have a better solution! All help greatly appreciated.