Hello,
I have some questions about blast-2.10.1+. I have made a db as follows:
"path/to/NCBI/blast-2.10.1+/bin/makeblastdb.exe" -dbtype "prot" -in "path/to/fasta_db.txt" -parse_seqids -hash_index
I have then used this in a blast+ task run using python's subprocess module - the equivalent command line entry is:
blastp -out results\out\path -query fasta\query.txt -db db_name -evalue eval -outfmt 5 -num_threads thread_num -parse_deflines -seg 'yes' -soft_masking 'true' -use_sw_tback -num_alignments '200'
When I look at the xml results, I get inconsistent hit id's (e.g. dbj|expected_accession|
, gb|expected_accession|
, ref|expected_accession|
). Can you tell me why this is happening, how to get the accession as the hit id with no prefixes, and, if that is not possible, whether I can assume the hit_id will always be contained within two '|' characters?
As you can see for the HSP below (sorry the formatting wont work properly), I also seem to be getting HSPs of length 0, which I assume is an issue relating to my blast command - however the other metrics (score etc) are there. Would you be able to tell me why this is based on the commands I have provided? I think it is an issue specific to this database (I have performed the same blast call with a different db).
<Hit>
<Hit_num>6</Hit_num>
<Hit_id>ref|YP_003273669.1|</Hit_id>
<Hit_def>Unknown</Hit_def>
<Hit_accession>Unknown</Hit_accession>
<Hit_len>0</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>65.855</Hsp_bit-score>
<Hsp_score>159</Hsp_score>
<Hsp_evalue>5.2884e-10</Hsp_evalue>
<Hsp_query-from>0</Hsp_query-from>
<Hsp_query-to>0</Hsp_query-to>
<Hsp_hit-from>0</Hsp_hit-from>
<Hsp_hit-to>0</Hsp_hit-to>
<Hsp_identity>63</Hsp_identity>
<Hsp_qseq></Hsp_qseq>
<Hsp_hseq></Hsp_hseq>
</Hsp>
</Hit_hsps>
</Hit>
Thanks, Tim
EDIT I noticed that there was an extra file when I made the DB (.pdb-lock), compared to when I had done this previously with success. I tried remaking the database as described above after deleteing all of the old files, and the same thing happened. I then tried making the database without -parse_seqids, the lock file wasn't generated, and I got normal HSPs with a len >0. All of this is leading me to suspect its an issue with how my database is being made, but I've made other dbs with -parse_seqids with no issues before...
This may have something to do with it:
Hi Genomax, I made a webscraper to get the sequences, although they were removed even when I made the webscraper. The sequences were then made into a local database, so i dont think it should be an issue (it might not be in ncbi's databases but it is in mine). I've also checked all the formating of deflines etc, so it's a bit of a mystery!
From the fasta I used to make the db:
Since we can't see/access your data this may be difficult to diagnose. Someone else may have additional insight. Use
blastdbcheck -db DB
to verify that database looks good.