Here's one approach. It's a little hacky, and going back to the original sequence to find the indicies might be unecessary, but I'm not 100% sure how pairwise2
behaves with local alignments (if it only returns the local match, and not the whole sequence then it is necessary), otherwise, the numbers you want are in the min and max values of the coords
array.
from Bio import SeqIO, pairwise2
from Bio.pairwise2 import format_alignment #optional
import sys
querystring = sys.argv[2]
for record in SeqIO.parse(sys.argv[1], 'fasta'):
aln = pairwise2.align.localms(record.seq, querystring,
5, -4, # match, mismatch
-2, -0.5) # open, extend
[print(format_alignment(*a)) for a in aln] #optional
coords = [i for i in range(len(aln[0][0])) if aln[0][0][i] == aln[0][1][i]]
subseq = aln[0][0][coords[0]:coords[-1]+1]
seqstart = str(record.seq).index(subseq)
print('Subseq coordinates: {} -> {}'.format(seqstart, seqstart+len(subseq)))
Disclaimer:
This is hacky as I said, and there may be an off-by-one error or two (ironic huh?) in the output numbering, so please check this.
I've chosen some alignment values that seemed to give me reasonable alignments in my testing, but you may need to tweak these to suit. I suspect gap opening may need to be penalised more strongly to promote a 'tight' local match so that the minimum amount of extraneous sequence is selected.
You might want to try the approach input in this thread:
A: Get genome position and sequence with extra bases of a DNA sequence (python)
The pairwise2 approach would probably work too. Default parameters should be fine.
I have something like this atm but the sequences from the plasmids are very error prone and so it is unlikely that the main sequence will contain the sub sequence exactly. I was trying to use pairwise to get the alignment score but it I'm not sure how to get it to find a point in the first sequence that contains the highest alignment score for the subsequence.
The sequence you've provided is not a whole plasmid sequence, right?
It looks like some gene upstream. In this or similar cases you have to find
orthologous gene upstreams (the same gene from close species), find out, how
your site looks like there, then make a PWM (position weight matrix) - it will give you
a site score, since the site may be different in different plasmids and different upstreams.
This approach will take care about single substitutions in your site. There are many posts
about PWM in 'Biostars', read them to learn more.
In particular, see this post. PWM matching alogrithm
MOODS has modern references I've not seen anywhere else.
OP doesn’t need to do any of this.
Their approach with alignment is the correct way to go, they just need to extract the hit coordinates. I assume this is possible from pairwise2 but I haven’t had chance to test it myself yet.
OP, if you’re happy with a non python solution my comment above will do exactly what you need.
Alternatively, you can do this with the regex engine, if you know which bits of the sequence are conserved and which are variable in your query?
I 've seen this post that should help to find identical residues.
A: index of the residue matched using pairwise2 in biopython
I hope it may help you.
Again, I don't think this is what OP is looking for. They want to know the indices of their sequence in the original plasmid; e.g. if their sequence of interest occurs at base 1012 to base 1020 in the reference plasmid sequence, they want those numbers.
The link you've provided is just showing how to enumerate the match itself (which is not connected in any way that I can see to the original numbering of the input sequence.
As far as I have understood, a person who asked a question
was going to get a position of some fragment in a larger sequence.
But that particular fragment may differ a little bit from a real plasmid
sequence. The matrix I suggested a few days ago is an ancient approach,
but it would help. The alignment will show if and where the difference exists.
They will decide what is better for them - there is no reaction so far.
On closer inspection, I don't think there's any way to get the information you want directly from pairwise2. Do you still specifically want a Biopython solution?