I figured it out! I recommend reading pages 1 and 2 of the research paper before continuing.
Essentially we are in fact doing what I described in the question. We're aligning the blocks to the protein, and taking the best non overlapping path. The clever part is that the algorithm does it all in one function!
The input to the algorithm is the following 3 variables:
Mystery_dna
= The big test DNA strand with lots of exons and introns which we'll get our local alignment blocks from.
Known_protein
= The DNA sequence which we're comparing our blocks to.
Blocks
= Every single block from the Mystery_dna which matches a section in Known_protein. These are the local alignments.
The output is going to be the score for the best aligned path of blocks. (if we wanted the path we could easily enough store path information as we went, but for simplicity we should not do that now).
Let's set up our variables now.
Mystery_dna[i] = the value of the i point in the DNA strand
(example: ATCGATCG would be G at i = 4)
Known_protein[j] = the value of the j point in the protein strand
Blocks[k] = one block of nucleotides
Up_to_block[k] = all the blocks up to and excluding the ones that contain Mystery_dna[k], this is to weed out overlaps.
(example: Mystery_dna = ATCG
Blocks = [AT][GC]
Up_to_block[3] = The blocks not including Mystery_dna[3] = [AT])
Last (Block[k]) = The last position in Block k, which will match that position on the Mystery_dna.
First (k) = The first position in Block k.
#this is a comment, which I'll start with a #
Now that we have our variables, we should go over what happens in the recursion. What will happen is we'll generate a 3d table which is just the same 2d table dynamic programming thing that is explained all through chapter 6 of the book with an added dimension of Blocks. At each iteration we'll go through each current point at each block. If we're at the start of any block then we need to use the Up_to_block function and to weed out overlaps and use the value of the greatest Block's last point (up to Block[current]). We'll use +1 for a match and -1 for an indel or mismatch for simplicity.
The recursion: (remember, i, j ,k
are indices for our inputs)
Score (i, j, k):
Return Max of
# These 2 lines are normal dynamic programming, just finding the best path along the protein.
if i != First (k): Score (i-1, j-1, k) + 1
if i != First (k): Score (i-1, j, k) - 1
# THIS is the hard part! If we're at the start of our block-
# (and remember we'll go through each point for each block)
# -then continue the value from the greatest, non-overlapping block up to this point. It won't
# overlap because of "Up_to_block[k]"
if i == First (k): Return Max of the following for all cur_block in Up_to_block[k]
Score (Last (cur_block), j-1, cur_block) + 1
# This line just happens if we have a mismatch at our first position in the block.
if i == First (k): Return Max of the following for all cur_block in Up_to_block[k]
Score (Last (cur_block), j, cur_block) - 1
Score (i, j-1, k) -1
Now, if you made it this far you have a 3d array of values, and the one in one of the Blocks at the end of the array will be our score.
Return the Max of this for each block in Blocks:
Array [Last[block], end_pos_of_protein, block]
And that will be the max score!
Note: lh3's answer where he mentioned doing Waterman and chaining at the same time was key to me figuring this out, so I'm marking that as the answer.
Interesting. I'm now looking into the GeneWise algorithm, and I'd be interested to know what algorithm is the current best gene predition one.
" It is essentially doing alignment chaining and Smith-Waterman alignment at the same time. "
My mind is now blown. It does Waterman and chaining at the same time, yet is suboptimal? How can it do both at the same time? I would be very grateful if you could summarize what's happening in the book's algorithm through explaining one iteration of the process. Also I appriciate all the time you've already spent on this!
On the spliced alignment, I realize later that it should be possible to jump between blocks. Then it could be faster than GeneWise. But I still think chaining will be much faster and give similar albeit slightly lower accuracy. As to why spliced alignment is suboptimal, exon blocks are usually determined by heuristic. They can be wrong. Meanwhile, you don't really need these blocks. Doing a dynamic programming in one go like GeneWise is likely to give more accurate result. As to which is the state-of-art protein-to-genome aligner, I think it is Exonerate. I am not so sure, though.
My understanding of the spliced alignment comes from the text description. It is possible and sensible to do SW inside blocks. SW is a dynamic programming; chaining is, too. There should naturally be a dynamic programming for the integrated problem. You only need to augment the state space. However, I haven't read the equations. That is why I said I could be wrong.