A new local alignment algorithm
0
1
Entering edit mode
8.3 years ago

I am a computer scientist who got interested about sequence alignment. I picked the subject by learning the Smith-Waterman algorithm, which I think is a very reasonable first attack on the local sequence alignment problem. However, using certain tricks, I can do better on the average.

I have written a local sequence alignment algorithm with the following properties:

  • It is exact, meaning that if the sequence S1 occurs in the sequence S2, the algorithm will always find it. In other words, I am not using any kind of heuristics or statistics.
  • The worse-case computational complexity is O(nm). However, this happens only for sequences that have a very special structure. I am quite confident one never sees something like that in any real data.
  • The average-case computational complexity is approximately O(n+m). I use the word "approximately" because the math is quite difficult: maybe there's actually some log log n term missing or something. The approximation is based on simulations with input sizes ranging from 10 to 100,000,000.
  • Insertions or deletions are not allowed, only mutations. (This is an obvious limitation compared to SW but I have already an idea how they could be added)
  • As a practical example, if S1 is 1000 characters, and S2 is 100,000,000 characters, it takes about 100 ms to do the alignment on one 1 GHz processor. I think it will be less than 1 ms after some code optimizations.

I have been reading several bioinformatics papers about sequence alignment but do not have a very clear picture of the field yet. So, I just wanted to ask do you think the above-mentioned algorithm would be useful? Has someone else already published comparable algorithm? I would like to stress the exactness here, since software such as BLAST appear to use heuristics.

Finally, I should maybe add that the algorithm is largely based on another (very successful) algorithm in an entirely different domain. I just modified it for this purpose.

This is my first post here so please let me know if my question could be improved. I am not yet familiar with much of the bioinformatics field.

sequence alignment • 1.9k views
ADD COMMENT
2
Entering edit mode

The Smith-Waterman algorithm is exact (it does not use any heuristics), it allows for insertions and deletions and the complexity is O(nm). A very hard to beat one :)

ADD REPLY
0
Entering edit mode

Thanks for the comment. For non-special inputs, I can do the exact alignment in approximately O(n+m) time. So if I understand your comment correctly, I should try to publish my results?

ADD REPLY
1
Entering edit mode

The local sequence alignment problem has been worked on for many (40+) years now and I would be surprised if an efficient exact method used in another field had been missed, but I suppose it's possible but so far most fast algorithms seem to be using heuristics. For an idea of approaches used, have a look at this review from 2010. Since then, there have been a few reports of new algorithms e.g. FOGSAA. As for how useful it would be, you'd have to provide a few compelling examples and release an easy to use implementation in the wild to see if it gets adopted over existing methods.

ADD REPLY
0
Entering edit mode

Thanks! I will read the papers. I guess the compelling example could be something that is difficult to compute with the Smith-Waterman, e.g. performing thousands or millions of alignments in little time, searching entire databases or something like that. Any ideas would be welcome. Actually, I have not found any great test dataset yet. A parallel version of the algorithm could align against 1000 human genomes in less than 10 seconds. Would that be any good? (I really have no idea about the biology side of this thing)

ADD REPLY
1
Entering edit mode

The reality is that gapless alignment is of very limited utility. "The biology side of this thing" tends to introduce insertions/deletions/rearrangements/etc.

As a general rule, the best approach is to understand the problem (and existing solutions) before building a better mousetrap...

ADD REPLY
0
Entering edit mode

Can it be scaled up to solve multiple-sequence alignment? I mean really doing the alignment (what would require an N^d with NW), not with shortcuts. This is global alignment though. How are the memory requirements?

ADD REPLY

Login before adding your answer.

Traffic: 1548 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6