This is my aim. I am retrieving a list of BLAST results in FASTA format for a single sequence. What I then want to do is sort any hits that are within a certain distance of each other into another list. Currently all I can do is work out the "start" location for any given hit within its genomic context:
def startfind(complete_handle, hit_handle):
from Bio import SeqIO
completes = list(SeqIO.parse(complete_handle, "fasta"))
hits = list(SeqIO.parse(hit_handle, "fasta"))
for index, hit_record in enumerate(hits): #This is adapted from another function where I actually use the index for something I just forgot to remove it here
for complete_record in completes:
if hit_record.id == complete_record.id:
start = complete_record.seq.find(hit_record.seq)
print hit_record.id, start
What I need to be able to do is this. For every single possible pair of hits where hit_record.id is equal, I want to subtract the "start" for one from the other and then check whether the absolute value/distance is within a certain range Something that would be along the lines of:
if (hit_record.id1 == hit_record.id2 and 7000 < abs(start1-start2) && abs(start1-start2) < 8000):
print [some specific file] >> hit_record1
print [some specific file] >> hit_record2
But for every possible pair of hits. How can I do this?
EDIT: I just figured it out, but if anyone can think of a more efficient way (other formats of data are welcome too) I'd prefer that.
def fullfind(complete_handle, hit_handle):
from Bio import SeqIO
completes = list(SeqIO.parse(complete_handle, "fasta"))
hits = list(SeqIO.parse(hit_handle, "fasta"))
flanks = open("flanks.fasta", "w")
for hit_record1 in hits:
for hit_record2 in hits:
for complete_record in completes:
if hit_record1.id == complete_record.id and hit_record2.id == complete_record.id):
start1 = complete_record.seq.find(hit_record1.seq)
start2 = complete_record.seq.find(hit_record2.seq)
if (8000 < start1-start2 and start1-start2 < 9000):
#Don't need absolute values because every possible pair is considered and there is always one pair of hits where the difference is positive
print complete_record.id
fullfind("complete.fasta", "hit.fasta")
Personally, I think you're actually over-complicating things using BioPython. Use a tab-delimited blast output, which would be easier to parse. The first column is your query id, and the 7-8 cols are your start and end position of the blast result.
Isn't tab-delimited blast output a format you get by converting it from XML format using Biopython? It's not a default format that NCBI BLAST lets you download. And it doesn't give you the actual sequences, so I'd still need a second file containing the complete sequences to eventually retrieve flanks. In the end I need to access two files.
If you're running BLAST on the command-line you can use -outfmt 6 to give you a tab-delimited output. You are correct you will need both files. Your script would need to read the blast output, I would use a dictionary with the query id as the key, and a tuple value containing the start/end positions. Then, read in your fasta, and process the sequences by sequence header (query id) and blast results.
Thanks, I'll consider rewriting the script to automatically carry out a blast search, then convert the xml output into a tab-delimited and FASTA formats so I don't need to access two files.