How to run diamond blastp to get all vs all similarity score between proteins?
0
0
Entering edit mode
12 months ago

Hi.

I have a .fa file representing multiple predicted proteins.

Example:

>NC_001341.1_1 # 397 # 714 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.314
MCSSSIISEKHLKKNIFQKKAKVQYKIKKNRRGQINENKCSINPNKKRSKKIKKLAKQKD
IQACINIGNRYVDVPIRPVSVADPDTPKETKEDKEKGCHFRNGIH*
>NC_001341.1_2 # 788 # 1009 # 1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.306
MQCLISNEYHHNNNEHTSCINRINRNYRSNQRHHQGYNDLYDSINIIQGMLENLNASIVY
FTKDGKYKLIMTL*

Given this multiple predicted proteins, I want to use diamond blastp to compare pairwise all the proteins obtaining this way a matrix proteinxprotein with similarity scores between proteins.

This is my current code

diamond makedb  --in diamond_database/database.fa -d diamond_database/database.dmnd

 diamond blastp --sensitive -d diamond_database/database.dmnd -q CDS/all_protein.translations.fa -o 
 diamond_database/results.tab -k 1

Where I obtain something like this:

NC_028834.1_2   YP_009201735.1  100 171 0   0   1   171 1   171 3.64e-128   358
NC_028834.1_4   YP_009201736.1  100 110 0   0   1   110 1   110 7.49e-76    221
NC_028834.1_5   YP_009201737.1  100 178 0   0   1   178 1   178 7.81e-122   343
NC_028834.1_6   YP_009201738.1  100 46  0   0   1   46  1   46  1.11e-26    92.4
NC_028834.1_7   YP_009201739.1  100 184 0   0   1   184 1   184 1.09e-115   328
NC_028834.1_8   YP_009201740.1  100 494 0   0   1   494 1   494 0.0 1065

I am not sure how to interpret this but it seems to be comparing with a reference rather than pairwise.

protein blastp clustering diamond • 984 views
ADD COMMENT
0
Entering edit mode

How many proteins do you have, what do want to do with the data, and could you maybe use a standard phylogenetic approach (multiple sequence alignment -> ML or Bayesian phylogenetic analysis) ? None of the values from a local alignment is really good for generating a distance matrix and a phylogenetic analysis based on a distance matrix isn't state-of-the-art.

ADD REPLY
0
Entering edit mode

just to make sure, there are two different intermingled questions here,

for the start, diamond will take each sequence from the query and search it against each target sequence. So it is an all-to-all comparison but only the sequences that can generate hits will be reported. So perhaps you can reformulate what you mean by: "seems to only compare with a reference rather than pairwise"

then when you run a blastp alignment you get similarity scores of various kinds, whether those are meaningful or not is a different issue of course

ADD REPLY
0
Entering edit mode

I am not sure why you are predicting proteins for this RefSeq genome. You can download a protein fasta file by using the genome page here: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000847085.1/

Click on Download button --> Select following options and download a protein fasta file.

download

You will get the following 4 proteins

>NP_039813.1 hypothetical protein L1_1 [Acholeplasma phage MV-L51]
MEYIEITVNKSQHFVNTDQIYNMLWSSAIMQCLISNEYHHNNNEHTSCINRINRNYRSNQRHHQGYNDLYDSINIIQGML
ENLNASIVYFTKDGKYKLIMTL
>NP_039814.1 hypothetical protein L1_2 [Acholeplasma phage MV-L51]
MKKILFSIIAIFTLLITIKGVHAYDTGSNVLTNNAYTVTYTSEPLGILAKQSKYQYHSMATQSAMIHWLLMVYIAVGYIY
DTSTSYSGIGQGYSPQVTQTDRAIYYQGQLKINTTNIKTSKRVKDIHNQIKQTQPYNYITFTTQKQTTSTKQYQQSKQQY
QSY
>NP_039815.1 hypothetical protein L1_3 [Acholeplasma phage MV-L51]
MKKYAKLIFIFIMIIFIMNYFIENADAFRLIIMEVLQVQSTKSFTIFIQLYSFKKVINRLCYIFRFLPFVSSLSLIYLIL
SWRIHHIVGHKMTKSLLDMSFSTWSLYLGSILLYTQVVWGLSLLIIAIIALYFILKFWWNNSQVTKDFTLYNIHIYGAKQ
RGKDLTTQTIYRRFHKEYNKRIKQLKNALKLVFK
>NP_039816.1 hypothetical protein L1_4 [Acholeplasma phage MV-L51]
MKFQKIHKYEGLDLFLSDVQTALPKQTSFTHWLPVFFGLAGQLYNMLIIINSQEFERPWVKLRGQQDKYYRAIKTTPFNK
SFIQKYVWNEIPILKNYIFIKIRSYSELQSAQQNVLPFNAIGLINEGSKHLVLTAGQATKEQFEATYGEIKERTIVINKK
HIMYDSRIFHKYFYGITAQGAIELYNKIESLTPPNFIKQNARGRKRYLLLRNKCPVGIIKSLVAVCQGLDVIKDGMALKL
LENSFCSKKHKAYDTLKALRITKNLN
ADD REPLY

Login before adding your answer.

Traffic: 2264 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6