I'm looking for a way to BLAST each sequence in a file, protein sequences in FASTA format, against all the other sequences in the same file. I'm only interested in the best HSP per sequence-sequence pair. So, for 10 sequences I'd get 10*10=100 outputs. Any easy and efficient way to do this? I'm using BLAST (blastp) 2.2.29+.
Update 1
Say I have 3 sequences in sequences.fasta:
>sp|P0C9J3|1101B_ASFWA Protein MGF 110-11L OS=African swine fever virus (isolate Warthog/Namibia/Wart80/1980) GN=War-022 PE=3 SV=1
MKVLLGLLLGYSVLILAHELPDLPRTQHPPKSELSYWCTYVPQCDFCWDCQDGICKNKIT
ESRFIDSNHSIVNCRVFRDSKTQSCLYEISSKMPNHFNMECLHPRPYTGNEIFMRTWGGG
DHQQLSIKQFCLYFIIGIAYTGCFVCALCKNLRLRTTMKLFILLSILVWLAQPVLNRPLS
IFYTKQILPRTYTPPMRELEYWCTYGKHCDFCWDCKNGICKNKVLDDMPLIVQNDYISKC
SITRFIDRCMYFIEPKIPYIHYMNCSLPTYFS
>sp|O55739|127L_IIV6 Uncharacterized protein 127L OS=Invertebrate iridescent virus 6 GN=IIV6-127L PE=4 SV=1
MTDYKHKINAEGEYLPFYGYTSISMLSSNTSSVQVLQNIEQLISQTIIGKYYSPLPHTTY
HMTLFNIYCMASSPIPPVQRWTLENEENKIPEKLWLSEDVLKIKHIKALDCLRKLPSHLK
ITNLEFYYKKGLGVWVTLDEESFNAVTKLRKEFSVIYEHDDANLKLHITFAYLFKELPFK
NGTREEKKALKTELLKLVKMLNTLKLWTLTNHNIYLYNSMTNYFPIDFFFSSSLV
>sp|P32234|128UP_DROME GTP-binding protein 128up OS=Drosophila melanogaster GN=128up PE=2 SV=2
MSTILEKISAIESEMARTQKNKATSAHLGLLKAKLAKLRRELISPKGGGGGTGEAGFEVA
KTGDARVGFVGFPSVGKSTLLSNLAGVYSEVAAYEFTTLTTVPGCIKYKGAKIQLLDLPG
IIEGAKDGKGRGRQVIAVARTCNLIFMVLDCLKPLGHKKLLEHELEGFGIRLNKKPPNIY
YKRKDKGGINLNSMVPQSELDTDLVKTILSEYKIHNADITLRYDATSDDLIDVIEGNRIY
IPCIYLLNKIDQISIEELDVIYKIPHCVPISAHHHWNFDDLLELMWEYLRLQRIYTKPKG
QLPDYNSPVVLHNERTSIEDFCNKLHRSIAKEFKYALVWGSSVKHQPQKVGIEHVLNDED
VVQIVKKV
I created a database with
user% makeblastdb -in sequences.fasta -title sequences -dbtype prot -out sequences -parse_seqids
Running each sequence as a query against the other, separately
user% blastp -query P0C9J3.fasta -db sequences -outfmt 6 -evalue 100
sp|P0C9J3|1101B_ASFWA sp|P0C9J3|1101B_ASFWA 100.00 272 0 0 1 272 1 272 0.0 564
sp|P0C9J3|1101B_ASFWA sp|P32234|128UP_DROME 50.00 12 6 0 188 199 291 302 3.7 15.8
sp|P0C9J3|1101B_ASFWA sp|O55739|127L_IIV6 33.33 12 8 0 110 121 88 99 54 11.9
user% blastp -query P32234.fasta -db sequences -outfmt 6 -evalue 100
sp|P32234|128UP_DROME sp|P32234|128UP_DROME 100.00 368 0 0 1 368 1 368 0.0 756
sp|P32234|128UP_DROME sp|P0C9J3|1101B_ASFWA 50.00 12 6 0 291 302 188 199 4.5 15.8
sp|P32234|128UP_DROME sp|O55739|127L_IIV6 35.29 17 11 0 255 271 106 122 7.7 15.0
user% blastp -query O55739.fasta -db sequences -outfmt 6 -evalue 100
sp|O55739|127L_IIV6 sp|O55739|127L_IIV6 100.00 235 0 0 1 235 1 235 1e-177 481
sp|O55739|127L_IIV6 sp|P32234|128UP_DROME 35.29 17 11 0 106 122 255 271 5.0 15.0
sp|O55739|127L_IIV6 sp|P0C9J3|1101B_ASFWA 33.33 12 8 0 88 99 110 121 47 11.9
sp|O55739|127L_IIV6 sp|P0C9J3|1101B_ASFWA 57.14 7 3 0 120 126 58 64 91 11.2
O55739 but with -max_hsps 1
added (only getting one result per query-target pair, wanted behaviour)
user% blastp -query O55739.fasta -db sequences -outfmt 6 -evalue 100 -max_hsps 1
sp|O55739|127L_IIV6 sp|O55739|127L_IIV6 100.00 235 0 0 1 235 1 235 1e-177 481
sp|O55739|127L_IIV6 sp|P32234|128UP_DROME 35.29 17 11 0 106 122 255 271 5.0 15.0
sp|O55739|127L_IIV6 sp|P0C9J3|1101B_ASFWA 57.14 7 3 0 120 126 58 64 91 11.2
All sequences against all
user% blastp -query sequences.fasta -db sequences -outfmt 6 -evalue 100
sp|P0C9J3|1101B_ASFWA sp|P0C9J3|1101B_ASFWA 100.00 272 0 0 1 272 1 272 0.0 564
sp|P0C9J3|1101B_ASFWA sp|P32234|128UP_DROME 50.00 12 6 0 188 199 291 302 3.7 15.8
sp|P0C9J3|1101B_ASFWA sp|O55739|127L_IIV6 33.33 12 8 0 110 121 88 99 54 11.9
sp|O55739|127L_IIV6 sp|O55739|127L_IIV6 100.00 235 0 0 1 235 1 235 1e-177 481
sp|O55739|127L_IIV6 sp|P32234|128UP_DROME 35.29 17 11 0 106 122 255 271 5.0 15.0
sp|O55739|127L_IIV6 sp|P0C9J3|1101B_ASFWA 33.33 12 8 0 88 99 110 121 47 11.9
sp|O55739|127L_IIV6 sp|P0C9J3|1101B_ASFWA 57.14 7 3 0 120 126 58 64 91 11.2
sp|P32234|128UP_DROME sp|P32234|128UP_DROME 100.00 368 0 0 1 368 1 368 0.0 756
sp|P32234|128UP_DROME sp|P0C9J3|1101B_ASFWA 50.00 12 6 0 291 302 188 199 4.5 15.8
sp|P32234|128UP_DROME sp|O55739|127L_IIV6 35.29 17 11 0 255 271 106 122 7.7 15.0
All sequences against all and adding -max_hsps 1
flag
user% blastp -query sequences.fasta -db sequences -outfmt 6 -evalue 100 -max_hsps 1
sp|P0C9J3|1101B_ASFWA sp|P0C9J3|1101B_ASFWA 100.00 272 0 0 1 272 1 272 0.0 564
sp|O55739|127L_IIV6 sp|O55739|127L_IIV6 100.00 235 0 0 1 235 1 235 1e-177 481
sp|P32234|128UP_DROME sp|P32234|128UP_DROME 100.00 368 0 0 1 368 1 368 0.0 756
I would like to have the result as presented when running each sequence separately against the sequences-database, but all presented together by just running one command, i.e. 3 hits separately but an output-list of all 9. If I run sequences.fasta
as query against the database and with -max_hsps 1
flag, I only get the top hit for each query sequence. Maybe I'm misunderstanding the -max_hsps
flag? I could solve this by scripting and run a separate query for each sequence, but that would cause a loss of the multi-core ability?
I updated the original question with some more information. Hope that it is the correct procedure. Thanks for helping.