I have a fasta file with 254 sequences. I created a blast database with masking and then ran the blastp using that database and the input fasta (with masking again). (commands shown below)
But in the results there are only 203 sequences used as query (while the subjects are correct 254). The output of my tabular blastp looks like this, while I would expect the last line to be "254 ... ... ... ...":
qry id, sbj id, % identity, length, mismatches, gap_opens, q_start, q_end, s_start, s_end, evalue, bit_score
1 30 29.42 673 409 19 4 643 7 646 2e-66 242
1 30 38.26 115 71 0 781 895 645 759 1e-22 106
1 185 27.99 661 350 20 289 889 322 916 2e-59 223
2 253 28.86 648 366 20 267 895 209 780 9e-58 216
.
.
.
203 16 41.30 293 148 3 607 895 529 801 2e-57 216
203 16 29.75 511 305 13 44 542 64 532 5e-40 162
Note that query sequence #2 is matched against sequence #253, but sequence #253 is not queried at all, the last sequence being 203.
I'm not sure if I'm expecting the right thing? Shouldn't the last line be sequence #254 queried against some matching subjects? (the sequences are mostly similar it is very unlikely that 204-254 don't align with anything). Or is this the correct result that I should have? If so, can you explain what happens to #204-#254? Thanks!
Here is how I have ran my blast:
./segmasker -in my_fasta.fasta -infmt fasta -outfmt maskinfo_asn1_bin -out my_seg_output.asnb
./makeblastdb -in my_fasta.fasta -input_type fasta -dbtype prot -mask_data my_seg_output.asnb -out my_db -title my_db
./blastp -query my_fasta.fasta -out my_fasta_blasted -evalue 1.0 -dbsize $db_size -max_hsps $hsps -seg "yes" -db_hard_mask 21 -db my_db -outfmt 6
It turned out that some time ago I asked a similar question.
A: each protein with each protein
The answers may be helpful.
Thanks for your reply! I looked into that question but they are explaining how to do the all-vs-all blast, I have done that (I have additionally done masking using segmasker, which might have caused the problem!?).
Look at this post:
A: How To Mask Low-Complexity Regions In Proteins?
I propose you may loose some proteins when you mask your data.
What happen when you omit masking?