I have three lists of sequences that I need to compare to each other, finding where they completely match. For example, if I have AGCTATGCTCTATCGTACGT
, TTAGCTATGCTCTATCGT
, and AGCTATGCTCTATCGTACGTTTCA
then they would match as
AGCTATGCTCTATCGTACGT
TTAGCTATGCTCTATCGT
AGCTATGCTCTATCGTACGTTTCA
They match in whole segments, no mismatches or gaps, but they can have extra letters at the start or end of the matching string. I could go ahead and write my own algorithm to do this, but I'm sure there's already an algorithm out there that handles this. I'm aware of algorithms such as this, but that takes gaps and mismatches into account, which sounds like more computation than is needed.
Anyone know of such an algorithm, or something that can point me in the right direction?
UPDATE: At first I thought Philippe had the answer I was looking for (especially with a link to working Python code), but then I realized the algorithm isn't quite what I need. For example, the longest-common substring of abba
and cbbc
returns bb
, where it should return null because c
doesn't match with a
.
RESOLVED: I decided to go and make my own algorithm. It's not the longest common substring, but it pointed me in the right direction, and runs in roughly the same amount of time.