Hello,
I have some Sanger sequence data and a database of Illumina sequences. I want to match the Sanger Sequence to the corresponding Illumina sequence. This is the code I used:
from Bio import SeqIO
from Bio import pairwise2
from Bio import Seq
fasta_sequences = SeqIO.parse(open("taxa_all.fasta"),'fasta') ### database of Illumina sequences
with open ("isolation-round1/778291/High_Intensity/18_F.ab1.seq") as myfile:
isolate=myfile.readline() ### the Sanger sequence
score=[]
name=[]
for fasta in fasta_sequences:
n,sequence = fasta.id, str(fasta.seq)
for a in pairwise2.align.localxx(isolate,sequence):
al1,al2,s,begin,end=a
score.append(s)
name.append(n)
When I ran this code, which took about 5 min, the scores I obtained were very close to each other, and I was therefore unable to say with certainty what the correct mapping was.
So then I changed it to
for a in pairwise2.align.localms(isolate,sequence,2, -1, -.5, -.1):
keeping everything else the same. This code took about 3h to run! On examining the results, I noticed that there were a 1000 repeats of each name
and score
. But I don't understand why. Am I doing something wrong? Is there a way to speed up the process?