Is there any way to filter BLAST results from the XML you get from Biopython's NCBIWWW module on the basis of percent identity? I can't find anything like that in the XML, which looks like this in what I think is the relevant section for a given result:
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>2673.88</Hsp_bit-score>
<Hsp_score>2964</Hsp_score>
<Hsp_evalue>0</Hsp_evalue>
<Hsp_query-from>1</Hsp_query-from>
<Hsp_query-to>1482</Hsp_query-to>
<Hsp_hit-from>4596</Hsp_hit-from>
<Hsp_hit-to>3115</Hsp_hit-to>
<Hsp_query-frame>1</Hsp_query-frame>
<Hsp_hit-frame>-1</Hsp_hit-frame>
<Hsp_identity>1482</Hsp_identity>
<Hsp_positive>1482</Hsp_positive>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>1482</Hsp_align-len>
Here's the code I used to generate that XML:
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import SeqIO, Entrez
file_to_read = 'liberibacter_16s_sequences.fasta'
blast_list = []
for record in SeqIO.parse(file_to_read, 'fasta'):
result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)
blast_list.append(result_handle)
with open('results.xml', 'w') as save_file:
for handle in blast_list:
blast_results = handle.read()
save_file.write(blast_results)
save_file.close
Is there a way to parse this XML to pull out what I'm looking for, and if not, is there some way to adjust the parameters of my code to pull down that information from BLAST?
Do you require it to be in XML format? You could easily return a BLAST tab format for example and filter by column.
The Biopython docs say everything else breaks easily, but there's no particular reason my project needs it to be in XML. I'll give that a shot.