Trying To Find Appropriate Algorithm, Similar To String Alignment
4
6
Entering edit mode
13.0 years ago
Jesse J ▴ 150

I have three lists of sequences that I need to compare to each other, finding where they completely match. For example, if I have AGCTATGCTCTATCGTACGT, TTAGCTATGCTCTATCGT, and AGCTATGCTCTATCGTACGTTTCA then they would match as

    AGCTATGCTCTATCGTACGT
  TTAGCTATGCTCTATCGT
    AGCTATGCTCTATCGTACGTTTCA

They match in whole segments, no mismatches or gaps, but they can have extra letters at the start or end of the matching string. I could go ahead and write my own algorithm to do this, but I'm sure there's already an algorithm out there that handles this. I'm aware of algorithms such as this, but that takes gaps and mismatches into account, which sounds like more computation than is needed.

Anyone know of such an algorithm, or something that can point me in the right direction?

UPDATE: At first I thought Philippe had the answer I was looking for (especially with a link to working Python code), but then I realized the algorithm isn't quite what I need. For example, the longest-common substring of abba and cbbc returns bb, where it should return null because c doesn't match with a.

RESOLVED: I decided to go and make my own algorithm. It's not the longest common substring, but it pointed me in the right direction, and runs in roughly the same amount of time.

sequence • 3.5k views
ADD COMMENT
7
Entering edit mode
13.0 years ago
Andrew Su 4.9k

Unless computational performance is critical for your application (it doesn't sound like it is), then why not just use any old multiple sequence alignment program? Set your gap penalties to be high so the algorithm doesn't chase down dead ends and that will probably minimize the added overhead. In any case, I would guess that would be faster than trying to track down an ungapped MSA program. On the scale of things, you have a very simple problem so I'd guess the computation wouldn't take long anyway.

ADD COMMENT
3
Entering edit mode
13.0 years ago

This sounds like a problem for Uclust.

ADD COMMENT
2
Entering edit mode
13.0 years ago
Philippe ★ 1.9k

What you try to do is to define the Longest Common Substring, there are several implementations of such algorithms available but there are generally restricted to pairwise comparisons.

The problem seems to be generalized to more than two sequences and is then referred as the k-common substring problem. You can find several articles or resources googling this term but that was starting to get too technical for me. =)

ADD COMMENT
2
Entering edit mode
13.0 years ago

yes this is clustering

http://www.vmatch.de/

ADD COMMENT

Login before adding your answer.

Traffic: 2367 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6