Hi all,
I'm trying to figure out a way to identify pairs of short sequences that have a perfect overlap.
This can be either one being contained in the other, such as:
- AAACCCTTTGGG
- CCTTTG
They can also be overlapping, such as:
- AAACCCTTTGGG
- TTTGGGTCGA
I want to differentiate those scenarios from situations where the match is not perfect ( a single gap or mismatch needs to directly disqualify the pairwise comparison). Basically I need to know whether they can both stem from the same template, but they don't need to be from the same position on the template.
I've been playing around with the penalty parameters of pairwise2, but I couldn't find a way that would allow me to write an if/else statement to automatically decide whether the sequences have a perfect overlap or not. The sequences differ in length, and also the overlapping regions differ, so I cannot just set a constant score threshold.
I would be great, if someone could help me out here. I'm sure, this is an easy exercise for many of you.
Best and many thanks in advance! Gero
So, are you interested in retaining examples where there is a mismatch (of 1, or of 1 or greater?). If you are only interested in perfect matches this will be quite a lot easier.