I am blasting a local fasta file of protein sequences against a local protein database. The Blast works fine and I am able to generate an XML file as my output. I would like to extract the query and top hit sequences from this XML file. However, my HSPs (high-scoring pair objects) in the XML file have missing information.
Here is the code that generates the XML file:
blast_handle = NcbiblastpCommandline(query='/PATH/to/proteins_of_interest.fasta', db='/PATH/to/local/db/protein.fa', outfmt=5, out='/PATH/to/results.xml')
stdout, stderr = blast_handle()
result_handle = open('/PATH/to/results.xml')
blast_records = NCBIXML.parse(result_handle)
I successfully get an XML file, and the following loop gives me the name of the top hit of each Blast (there were four protein sequences in my query fasta file):
for blast_record in blast_records:
print(blast_record.alignments[0].hit_id)
However, my XML file is missing HSP data. This is a portion of the XML file:
<Hit>
<Hit_num>9</Hit_num>
<Hit_id>lcl|name_of_my_hit</Hit_id>
<Hit_def>Unknown</Hit_def>
<Hit_accession>Unknown</Hit_accession>
<Hit_len>0</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>105.145</Hsp_bit-score>
<Hsp_score>261</Hsp_score>
<Hsp_evalue>9.96617e-26</Hsp_evalue>
<Hsp_query-from>0</Hsp_query-from>
<Hsp_query-to>0</Hsp_query-to>
<Hsp_hit-from>0</Hsp_hit-from>
<Hsp_hit-to>0</Hsp_hit-to>
<Hsp_identity>60</Hsp_identity>
<Hsp_qseq></Hsp_qseq>
<Hsp_hseq></Hsp_hseq>
</Hsp>
</Hit_hsps>
</Hit>
As you can see, there is nothing between the Hsp_qseq and Hsp_hseq start and end tags. Does anyone know why the amino acid sequences aren't here? Thanks!
Update: I ran the same code, except outputted to a text file instead of an XML file. What I get for each alignment looks like this:
So it appears that blast is looking for the alignment in the NCBI database in order to make the alignments, which I don't understand because I gave it a local database to search. And it clearly finds the best matches in the local database, hence the odd name of the match...so why isn't it just aligning the query to the hit it gets in my database?
Probably the error is not because BLAST is looking for the matched sequences in the NCBI database. BLAST is used by many to search local databases, there should not be such bug that breaks the software this way.
Sequence with id ... no longer exists error is reported in showalign.cpp in BLAST source code. Instead of reporting the actual error/exception they report the error we see. Most likely the error we see is the most common reason of failure in the two try-catch blocks that report this error. If you get the error consistently it should be possible to modify the source code to report the actual exception but it should be better to contact NCBI help desk first.
Thank you! I will contact the help desk.
Great - please update this question when you hear back from the NCBI - thanks.
This is definitely a BLAST+ problem (and an unusual one), not something in Biopython.