Imagine that you have two strings of approx. equal length. You can perform an almost exact match on both while allowing for mismatches.
GCGTGAACGCATTTGCCGGTGATCGT
****** * *
GCGTGAGACGCATTTGCCGGTGATCG
Now imagine that you could insert a insertion in either one of them and have an even better match.
GCGTGAACG-CATTTGCCGGTGATCGT
********* *****************
GCGTGAGACGCATTTGCCGGTGATCGT
How would you go about computing this extremely efficiently ? It's not a Smith-Waterman with a restricted search space since SW does not "know" what came before (in terms of operations) in the search upon computation of the next square in the matrix. If I take an insert once, I want it to be forbidden to take another in either string.
edit: I do not want seeding heuristics, I want an exact answer. edit2: I want the global alignment, not a local one
Can you clarify whether you want a global (end-to-end) alignment between the strings, or a local alignment? Do you want to allow mismatches as well as the single indel?
done, thanks for the comment !
What about allowing mismatches as well as the indel? If you require an exact match of the regions flanking the indel I believe there is a straightforward solution, but if these regions can have mismatches then it is more complicated.
no, you can have mismatches in those.
I'm missing something .. why not just using a function like the C strcspn ( http://www.cplusplus.com/reference/cstring/strcspn/ ) and compare the two strings a the location of the first mismatch ?
I'm missing something too, how does strcspn solve this ? It will just return the index of an ACGT (since all are in string2) in the first string which is likely to return almost always 0.
sorry that was the wrong C function. I was think of a C function that would return the index of the first occurence of difference between two strings. e.g: http://www.cplusplus.com/reference/string/string/find_first_not_of/
still, that is exact matching. How do you handle gaps ? How do you know where the gap occurs ? You check all positions ? It's more an algorithmic question rather than a programmatic one.