So here is my aim. I've carried out a BLAST search and downloaded the results in two formats: Complete FASTA sequences and aligned FASTA sequences. In the aligned FASTA file, I want to retrieve the ID and sequence for each hit (there are 32 hits). Then I want to take each hit and compare them to the complete sequences (there are 20 genomic regions the hits were found in) IF the hit ID and complete sequence ID match i.e. the aligned sequence belongs to that complete sequence, then I find the start of the hit sequence within the complete sequence and use that to retrieve the sequence of a flanking sequence upstream of the hit.
I can do this manually for one hit which I know matches a particular complete sequence like so:
from Bio import SeqIO
completelist = list(SeqIO.parse("complete.fasta", "fasta"))
hitlist = list(SeqIO.parse("hits.fasta", "fasta"))
start = completelist[0].seq.find(hitlist[0].seq)
print(completelist[0].seq[start-100:start]) #This returns the 100bp upstream of the hit
I skipped over a bunch of basics to learn Biopython, so this might look stupid but to do this automatically for every hit I've tried:
def flankfind(complete_handle, hit_handle):
from Bio import SeqIO
completes = SeqIO.parse(complete_handle, "fasta")
hits = SeqIO.parse(hit_handle, "fasta")
for hit_record in hits:
for complete_record in completes:
if hit_record.id == complete_record.id:
start = complete_record.seq.find(hit_record.seq)
print(complete_record.seq[start-100:start])
flankfind("complete.fasta", "hits.fasta")
EDIT1:
I noticed and fixed one problem in my script. I didn't have ".seq" in the "complete_record.seq.find(hit_record.seq) part. Now the script actually returns the correct upstream 100bp sequence for the very first hit, but stops there.
I have implemented the completes.seek(0) solution, but this returns the error:
Traceback (most recent call last):
File "<pyshell#147>", line 1, in <module>
flankfind("complete.fasta", "hits.fasta")
File "<pyshell#146>", line 11, in flankfind
completes.seek(0)
AttributeError: 'generator' object has no attribute 'seek'
EDIT2:
Okay, the issue is fixed. I just had to convert "hits" and "completes" into lists based on the solution offered here https://stackoverflow.com/questions/1271320/reseting-generator-object-in-python
My current code is this: def flankfind(complete_handle, hit_handle): from Bio import SeqIO completes = list(SeqIO.parse(complete_handle, "fasta")) hits = list(SeqIO.parse(hit_handle, "fasta"))
for index, hit_record in enumerate(hits):
for complete_record in completes:
if hit_record.id == complete_record.id:
start = complete_record.seq.find(hit_record.seq)
print '>', hit_record.id, index
print complete_record.seq[start-100:start]
flankfind("complete.fasta", "hits.fasta")
This returns the 100bp sequences upstream of the 32 hits with their indices and IDs though the formatting for the print section can be improved.
I'll be back in 30min or so but I tried this and it returned a single ID for the first FASTA sequence.
flankfind("complete.fasta", "hits.fasta")
'AC226770.1'
Thanks for the help
After fixing an issue in my original script I used completes.seek(0) again and got this:
I don't really understand the issue. You can use completes.next() to read through each SeqRecord item just fine. I only read up on seek() a little but doesn't that mean it should be able to set it back to the start of the file?
That is what it is suppose to do.
But SeqIO.parse returns a generator.