I've performed a nucleotide BLAST of a multi-fasta file against itself in order to find sequences which share highly similar flanking sequence, indicative of a duplication event. Given the .xml file produced from this what I need to do is determine cases where significant BLAST hits come not from the same sequence (query and subject are the exact same) but from different sequences. Essentially what I'm trying to do is parse the .xml output and get it to print details about the HSPs as well as IDs for the subject and query in each of those HSPs. I tried playing around with f.write(header.query_id)
and f.write(parameters.query_id)
but those just spit back an error. I suspect those need to be included in a separate for statement but right now I'm very lost.
I'm going off code based on section 7.4 of the Biopython 1.65 manual:
from Bio.Blast import NCBIXML
import sys
#>python screener.py inputfile.xml outputfile.txt
input_file=str(sys.argv[1])
output_file=str(sys.argv[2])
result_handle=open(input_file)
blast_records=NCBIXML.parse(result_handle)
#screens an XML file based on the specification below and prints those results to a file
with open(output_file, "w") as f:
for blast_record in blast_records:
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect< 0.01:
f.write('****Alignment****' + "\n")
f.write('sequence:'+ str(alignment.title) +"\n")
f.write('length:'+ str(alignment.length) +"\n")
f.write('e value:'+ str(hsp.expect) +"\n")
f.write(hsp.query[0:75] + '...'+"\n")
f.write(hsp.match[0:75] + '...'+"\n")
f.write(hsp.sbjct[0:75] + '...'+"\n")
Any suggestions would be, as always, greatly appreciated.
Have you tried blast_record.query_id?
Just gave that a shot. The output it gives is in the form:
Query_2
orQuery_593
, etc.So not very helpful