Missing HSPs from Blast xml output
0
0
Entering edit mode
8.1 years ago

I am blasting a local fasta file of protein sequences against a local protein database. The Blast works fine and I am able to generate an XML file as my output. I would like to extract the query and top hit sequences from this XML file. However, my HSPs (high-scoring pair objects) in the XML file have missing information.

Here is the code that generates the XML file:

blast_handle = NcbiblastpCommandline(query='/PATH/to/proteins_of_interest.fasta', db='/PATH/to/local/db/protein.fa', outfmt=5, out='/PATH/to/results.xml')

stdout, stderr = blast_handle()

result_handle = open('/PATH/to/results.xml')
blast_records = NCBIXML.parse(result_handle)

I successfully get an XML file, and the following loop gives me the name of the top hit of each Blast (there were four protein sequences in my query fasta file):

for blast_record in blast_records:
    print(blast_record.alignments[0].hit_id)

However, my XML file is missing HSP data. This is a portion of the XML file:

<Hit>
  <Hit_num>9</Hit_num>
  <Hit_id>lcl|name_of_my_hit</Hit_id>
  <Hit_def>Unknown</Hit_def>
  <Hit_accession>Unknown</Hit_accession>
  <Hit_len>0</Hit_len>
  <Hit_hsps>
    <Hsp>
      <Hsp_num>1</Hsp_num>
      <Hsp_bit-score>105.145</Hsp_bit-score>
      <Hsp_score>261</Hsp_score>
      <Hsp_evalue>9.96617e-26</Hsp_evalue>
      <Hsp_query-from>0</Hsp_query-from>
      <Hsp_query-to>0</Hsp_query-to>
      <Hsp_hit-from>0</Hsp_hit-from>
      <Hsp_hit-to>0</Hsp_hit-to>
      <Hsp_identity>60</Hsp_identity>
      <Hsp_qseq></Hsp_qseq>
      <Hsp_hseq></Hsp_hseq>
    </Hsp>
  </Hit_hsps>
</Hit>

As you can see, there is nothing between the Hsp_qseq and Hsp_hseq start and end tags. Does anyone know why the amino acid sequences aren't here? Thanks!

blast blastp BioPython hsp • 3.2k views
ADD COMMENT
0
Entering edit mode

Update: I ran the same code, except outputted to a text file instead of an XML file. What I get for each alignment looks like this:

Sequence with id evm_27.model.AmTr_v1.0_scaffold00105.17 no longer exists in database...alignment skipped

So it appears that blast is looking for the alignment in the NCBI database in order to make the alignments, which I don't understand because I gave it a local database to search. And it clearly finds the best matches in the local database, hence the odd name of the match...so why isn't it just aligning the query to the hit it gets in my database?

ADD REPLY
0
Entering edit mode

Probably the error is not because BLAST is looking for the matched sequences in the NCBI database. BLAST is used by many to search local databases, there should not be such bug that breaks the software this way.

Sequence with id ... no longer exists error is reported in showalign.cpp in BLAST source code. Instead of reporting the actual error/exception they report the error we see. Most likely the error we see is the most common reason of failure in the two try-catch blocks that report this error. If you get the error consistently it should be possible to modify the source code to report the actual exception but it should be better to contact NCBI help desk first.

ADD REPLY
0
Entering edit mode

Thank you! I will contact the help desk.

ADD REPLY
0
Entering edit mode

Great - please update this question when you hear back from the NCBI - thanks.

This is definitely a BLAST+ problem (and an unusual one), not something in Biopython.

ADD REPLY

Login before adding your answer.

Traffic: 1953 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6