Need Help Understanding Spliced Sequence Alignment Algorithm
2
5
Entering edit mode
11.5 years ago
corn8bit ▴ 140

I'm reading Jones & Pevzner's Introduction to Bioinformatics Algorithms and am stuck at the end of chapter 6.14: spliced alignment problem. (research paper here)(slides on the chapter here including pseudocode at slide 28)

I understand that if I have lots of local alignments between a sample and reference DNA that if I just took the largest weighted non-overlapping path then I would get a nonsensical protein (because the protein is cut up in different orders on both strands). Example:

Sample strand (x = intron): 789xxxx123xxxx146xxxx456xxx

Reference protein: 123456789

This would align 789123146456 if we just took the weight and overlap into account!

So the book starts talking about this strange dynamic programming directed weighted graph (DAG) to solve the problem, which looks different from any DAG we've seen in the book so far, and for the life of me I can't wrap my head around the algorithm or pseudocode.

My question: Why can't I just simply align each local alignment to the reference protein, and THEN take the largest weighted non-overlapping path? Like this:

[1    4 6]  ---local alignments
[123][456][789]  ---local alignments
 123  456  789  ---reference protein
= [123] [456] [789]

And if this is wrong, could you please explain the correct DAG algorithm in a way I could understand (I understand the book up to here, and DAGs and dynamic programming)? Thanks!

Edit: It appears that the book's algorithm is scanning every single possible chain at every point in an N^2 array... this can't be right!

Edit2: The algorithm (see slide 28 which I linked) generates a 3d graph! I think I almost understand this! Will post back when I do.

alignment homework algorithm • 4.6k views
ADD COMMENT
4
Entering edit mode
11.5 years ago
lh3 33k
  • Alignment chaining problem. This is to find a subset of colinear local alignments that optimizes an objective function (e.g. sum of local alignment scores). The problem can be reduced to finding the optimal path on a DAG and can be solved by dynamic programming in quadratic time. For some types of objective functions, it has O(nlogn) solutions. There are quite a few papers on optimal chaining. It has been widely used in genome-wide aligners and in cDNA/EST aligners. The longest common subsequence (LCS) problem is a special case of alignment chaining. I think your solution is essentially alignment chaining.

  • Exon chaining problem (in the book). I think using "chaining" here is unusual. To me, chaining implies colinearity, but exon chaining here does not. The use of word "chaining" confused me for some time. Anyway, I have not seen this problem used in practice.

  • Spliced alignment problem (in the book). The following is my understanding, which could be wrong. Unlike alignment chaining, spliced alignment ignores the local alignment of each block and reconstructs the alignment while chaining. It is essentially doing alignment chaining and Smith-Waterman alignment at the same time.

  • The genewise model. GeneWise is different from the above in that it does not require a list of exon blocks. It simply aligns a protein to a genomic region in a splicing-aware way. It is using dynamic programming. GeneWise used to be the best program to predict genes.

The spliced alignment problem in the book is theoretically interesting, but I am not sure if it is practically advantageous. If it still scans every base in the reference, it will have similar or worse time complexity to GeneWise but may be less accurate; if it jumps between exon blocks, alignment chaining may achieve similar accuracy and be faster. It would be better to introduce alignment chaining and the GeneWise model in my opinion.

ADD COMMENT
3
Entering edit mode

Interesting. I'm now looking into the GeneWise algorithm, and I'd be interested to know what algorithm is the current best gene predition one.

" It is essentially doing alignment chaining and Smith-Waterman alignment at the same time. "

My mind is now blown. It does Waterman and chaining at the same time, yet is suboptimal? How can it do both at the same time? I would be very grateful if you could summarize what's happening in the book's algorithm through explaining one iteration of the process. Also I appriciate all the time you've already spent on this!

ADD REPLY
2
Entering edit mode

On the spliced alignment, I realize later that it should be possible to jump between blocks. Then it could be faster than GeneWise. But I still think chaining will be much faster and give similar albeit slightly lower accuracy. As to why spliced alignment is suboptimal, exon blocks are usually determined by heuristic. They can be wrong. Meanwhile, you don't really need these blocks. Doing a dynamic programming in one go like GeneWise is likely to give more accurate result. As to which is the state-of-art protein-to-genome aligner, I think it is Exonerate. I am not so sure, though.

My understanding of the spliced alignment comes from the text description. It is possible and sensible to do SW inside blocks. SW is a dynamic programming; chaining is, too. There should naturally be a dynamic programming for the integrated problem. You only need to augment the state space. However, I haven't read the equations. That is why I said I could be wrong.

ADD REPLY
2
Entering edit mode
11.5 years ago
corn8bit ▴ 140

I figured it out! I recommend reading pages 1 and 2 of the research paper before continuing.

Essentially we are in fact doing what I described in the question. We're aligning the blocks to the protein, and taking the best non overlapping path. The clever part is that the algorithm does it all in one function!

The input to the algorithm is the following 3 variables:

Mystery_dna = The big test DNA strand with lots of exons and introns which we'll get our local alignment blocks from.

Known_protein = The DNA sequence which we're comparing our blocks to.

Blocks = Every single block from the Mystery_dna which matches a section in Known_protein. These are the local alignments.

The output is going to be the score for the best aligned path of blocks. (if we wanted the path we could easily enough store path information as we went, but for simplicity we should not do that now).

Let's set up our variables now.

Mystery_dna[i] = the value of the i point in the DNA strand
  (example: ATCGATCG would be G at i = 4)
Known_protein[j] = the value of the j point in the protein strand
Blocks[k] = one block of nucleotides 
Up_to_block[k] = all the blocks up to and excluding the ones that contain Mystery_dna[k], this is to weed out overlaps.
  (example: Mystery_dna = ATCG
  Blocks = [AT][GC]
  Up_to_block[3] = The blocks not including Mystery_dna[3] = [AT])
Last (Block[k]) = The last position in Block k, which will match that position on the Mystery_dna.
First (k) = The first position in Block k.
#this is a comment, which I'll start with a #

Now that we have our variables, we should go over what happens in the recursion. What will happen is we'll generate a 3d table which is just the same 2d table dynamic programming thing that is explained all through chapter 6 of the book with an added dimension of Blocks. At each iteration we'll go through each current point at each block. If we're at the start of any block then we need to use the Up_to_block function and to weed out overlaps and use the value of the greatest Block's last point (up to Block[current]). We'll use +1 for a match and -1 for an indel or mismatch for simplicity.

The recursion: (remember, i, j ,k are indices for our inputs)

Score (i, j, k):
  Return Max of 
    # These 2 lines are normal dynamic programming, just finding the best path along the protein.
    if i != First (k): Score (i-1, j-1, k) + 1 
    if i != First (k): Score (i-1, j, k) - 1
    # THIS is the hard part! If we're at the start of our block-
    # (and remember we'll go through each point for each block)
    # -then continue the value from the greatest, non-overlapping block up to this point. It won't 
    # overlap because of "Up_to_block[k]"
    if i == First (k): Return Max of the following for all cur_block in Up_to_block[k] 
      Score (Last (cur_block), j-1, cur_block) + 1 
    # This line just happens if we have a mismatch at our first position in the block.
    if i == First (k): Return Max of the following for all cur_block in Up_to_block[k] 
      Score (Last (cur_block), j, cur_block) - 1 
    Score (i, j-1, k) -1

Now, if you made it this far you have a 3d array of values, and the one in one of the Blocks at the end of the array will be our score.

Return the Max of this for each block in Blocks:
  Array [Last[block], end_pos_of_protein, block]

And that will be the max score!

Note: lh3's answer where he mentioned doing Waterman and chaining at the same time was key to me figuring this out, so I'm marking that as the answer.

ADD COMMENT

Login before adding your answer.

Traffic: 2092 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6