I have BLAST output in .xml format that I am parsing for certain features and writing them to a file. Here is the code:
from Bio.Blast import NCBIXML
import sys
input_file=str(sys.argv[1])
output_file=str(sys.argv[2])
e_value_thresh=float(sys.argv[3])
result_handle=open(input_file)
blast_records=NCBIXML.parse(result_handle)
with open(output_file, "w") as f:
for blast_record in blast_records:
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect< e_value_thresh:
f.write('****Alignment****' + "\n")
f.write('Subject Sequence:'+ str(alignment.title) +"\n")
f.write("Query Sequence: " + str(blast_record.query) + "\n") #also added
f.write('length:'+ str(alignment.length) +"\n")
f.write('e value:'+ str(hsp.expect) +"\n")
f.write(hsp.query[0:75] + '...'+"\n")
f.write(hsp.match[0:75] + '...'+"\n")
f.write(hsp.sbjct[0:75] + '...'+"\n")
What I need to do is also screen out HSPs where the subject and query are the same sequence, since my BLAST query and database files are identical. I thought of changing:
if hsp.expect< e_value_thresh:
to
if hsp.expect< e_value_thresh and str(alignment.title) is not str(blast_record.query):
but this generated an error instead.
I would also like to write where the HSP match begins for the query and subject but
f.write('Subject Start:' + str(hsp.subjct_start) + "\n")
just gave an error saying that subjct_start
was not an attribute of hsp
I'm still very new to Biopython and Python in general so I'm sure there is something very simple that I'm missing here, and I apologize if that's the case
Please read through the documentation here: http://biopython.org/DIST/docs/api/Bio.Blast.Record-module.html This will give you all modules, classes and methods available, and their context.
Right, that fixed my Start and End issue(dumb spelling mistake on my part, thanks!). I was viewing this document before and it did solve some other problems I was having but it wasn't able to answer how to screen out HSPs with the exact same query and subject.
Maybe compare gi from the names?