Hey guys,
for working with the sequencing data from our NGS-machine I need to look for short sequences within the reads, i.e. Primers, Barcodes etc.
I know there is a million tools for this out there, but for my own understanding I'm trying to write some python code, that performs this alignment for me.
I understand that you need the Needleman-Wunsch or Smith-Waterman Algorithm to do this. However I'm struggling to put the two together. Because for the Pattern I need global alignment because I'm looking for the full sequence of the barcode, but for the read I need local alignment because the pattern can be located at any point in the read.
I tried around with initializing the first row of the matrix to zeros and the first column to -1, -2, -3 etc. and this somewhat gives me the desired results. But when the pattern is very close to the end of the read sometimes it doesn't work out as I want.
As an example: read = "TTAACTGCCTGGTCTGGC" pattern = "TAACTGC"
My matrix looks like this:
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[-1. 0. 0. -1. -1. -1. 0. -1. -1. -1. 0. -1. -1. 0. -1. 0. -1. -1. -1.]
[-2. -1. -1. 0. -1. -2. -1. -1. -2. -2. -1. -1. -2. -1. -1. -1. -1. -2. -2.]
[-3. -2. -2. -1. 0. -1. -2. -2. -2. -3. -2. -2. -2. -2. -2. -2. -2. -2. -3.]
[-4. -3. -3. -2. -1. 0. -1. -2. -2. -2. -3. -3. -3. -3. -2. -3. -3. -3. -2.]
[-5. -4. -3. -3. -2. -1. 0. -1. -2. -3. -2. -3. -4. -3. -3. -2. -3. -4. -3.]
[-6. -5. -4. -4. -3. -2. -1. 0. -1. -2. -3. -2. -3. -4. -4. -3. -2. -3. -4.]
[-7. -6. -5. -5. -4. -3. -2. -1. 0. -1. -2. -3. -3. -4. -4. -4. -3. -3. -3.]]
Then I would backtrack from 8 to 2 and thats my alignment.
My question is how do professional programs tackle this problem, I couldn't find information anywhere on how to combine local and global alignment for finding barcodes in NGS-reads. Any help would be much appreciated.
Semi-global sequence alignment: What if there are two equally greatest elements in the last row? - I guess you need this.
Also look at this post: Could someone suggest a fast and efficient implementation of a glocal (semi-global) sequence aligner?
Thank you guys for the links. The images in the first link where actually very helpful for my understanding.
I think I have a better understanding now, of what I need to do. I have another question. For the semi-global alignment to work, do i need a positive bonus for matching characters like you do in the smith-waterman-algorithm or can I just use negative penalties for mismatch, inserts and deletions like the needleman-wunsch-algorithm and give a 0 for matching characters.
Thanks. :)
It was long time ago since I used these algorithms last time, but I am pretty sure that I've used the same scoring scheme for all of them. Usually score for match is bigger than score for others since you want to find the path with the biggest score, but I believe the difference between them matters, not the actual values themself (I can make parallels with likelihoods ratios, but it will overcomplicate things). So use any score you want, but keep score for match higher. Here is explained better: https://www.cs.cmu.edu/~02710/Lectures/ScoringMatrices2015.pdf
Thank you guys for your replys. Thank you for the recommendation of the emboss tool. As i said, right now I'm trying to figure out how exactly these algorithms work, thats why I'm trying to write the code myself but when I start to work on my own pipeline I'll definitely consider using it.
So I decided to use positive score for matches and negative scores for mismatch and indels. I think for the fitted alignment, where you look for a smaller sequence in a longer sequence it doesn't really make a difference, but when you allow overlapping alignment (i.e. reverse primer at the end of the read, that might be truncated because it spans the end of the read) then positive score for matching characters would lead to a higher score for longer matches as opposed to really short matches at the end of the read.
Please use
ADD COMMENT
orADD REPLY
fields for comments and replies. Answer field is reserved for solutions that can help resolve the original question.Whatever works is good =) if you avoid 0-trap - you are fine