Question

Trying To Find Appropriate Algorithm, Similar To String Alignment

6

Entering edit mode

13.5 years ago

Jesse J ▴ 150

I have three lists of sequences that I need to compare to each other, finding where they completely match. For example, if I have AGCTATGCTCTATCGTACGT, TTAGCTATGCTCTATCGT, and AGCTATGCTCTATCGTACGTTTCA then they would match as

    AGCTATGCTCTATCGTACGT
  TTAGCTATGCTCTATCGT
    AGCTATGCTCTATCGTACGTTTCA

They match in whole segments, no mismatches or gaps, but they can have extra letters at the start or end of the matching string. I could go ahead and write my own algorithm to do this, but I'm sure there's already an algorithm out there that handles this. I'm aware of algorithms such as this, but that takes gaps and mismatches into account, which sounds like more computation than is needed.

Anyone know of such an algorithm, or something that can point me in the right direction?

UPDATE: At first I thought Philippe had the answer I was looking for (especially with a link to working Python code), but then I realized the algorithm isn't quite what I need. For example, the longest-common substring of abba and cbbc returns bb, where it should return null because c doesn't match with a.

RESOLVED: I decided to go and make my own algorithm. It's not the longest common substring, but it pointed me in the right direction, and runs in roughly the same amount of time.

sequence • 3.7k views

ADD COMMENT • link updated 13.5 years ago by Jeremy Leipzig 23k • written 13.5 years ago by Jesse J ▴ 150

score 7 · Answer 1 · 2011-11-11

Unless computational performance is critical for your application (it doesn't sound like it is), then why not just use any old multiple sequence alignment program? Set your gap penalties to be high so the algorithm doesn't chase down dead ends and that will probably minimize the added overhead. In any case, I would guess that would be faster than trying to track down an ungapped MSA program. On the scale of things, you have a very simple problem so I'd guess the computation wouldn't take long anyway.

score 3 · Answer 2 · 2011-11-11

3

Entering edit mode

13.5 years ago

Martin A Hansen 3.0k

This sounds like a problem for Uclust.

ADD COMMENT • link 13.5 years ago by Martin A Hansen 3.0k

score 2 · Answer 3 · 2011-11-11

What you try to do is to define the Longest Common Substring, there are several implementations of such algorithms available but there are generally restricted to pairwise comparisons.

The problem seems to be generalized to more than two sequences and is then referred as the k-common substring problem. You can find several articles or resources googling this term but that was starting to get too technical for me. =)

score 2 · Answer 4 · 2011-11-11

2

Entering edit mode

13.5 years ago

Jeremy Leipzig 23k

yes this is clustering

http://www.vmatch.de/

ADD COMMENT • link 13.5 years ago by Jeremy Leipzig 23k