Hi everyone,
I am trying to compare sequences in a pairwise fashion to obtain percent similarity scores. For context, these sequences are non-coding DNA that have been duplicated as a result of tandem duplication events. The hypothesis is that by looking at the sequence similarity of these DNA, I can infer the relative recency of these duplications in relation to other duplications as they are not under selection pressure.
For unrelated sequences, I expect a percent similarity score of ~25% (random matches) and increasingly higher scores for sequences that have duplicated recently.
To do this, I have tried to use both global and local pairwise alignment algorithms (i.e., Needleman-Wunsch and Smith-Waterman). The global alignment algorithm is not appropriate for my dataset as I have sequences of different lengths, which would reduce similarity scores significantly due to end gaps. As for local alignments, they are optimal but produce inflated scores (~40%) even for unrelated sequences. For unrelated sequences of similar lengths, I have achieved the expected similarity scores (25-35%) by maximizing end gap penalties using Needleman-Wunsch.
Question: Is my approach sound and appropriate? Is pairwise alignment the way to go about doing this comparison, or am I completely relying on a wrong method? A sequence dot plot alignment is probably closest to providing the information I want, though it does not provide a numerical value (percent identity) that I can use to compare between pairs of sequences immediately.
Thank you very much for reading and please let me know if I can provide any additional details to clarify the question.