Question

How to run diamond blastp to get all vs all similarity score between proteins?

0

Entering edit mode

12 months ago

Bioinformatics_begginner ▴ 20

Hi.

I have a .fa file representing multiple predicted proteins.

Example:

>NC_001341.1_1 # 397 # 714 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.314
MCSSSIISEKHLKKNIFQKKAKVQYKIKKNRRGQINENKCSINPNKKRSKKIKKLAKQKD
IQACINIGNRYVDVPIRPVSVADPDTPKETKEDKEKGCHFRNGIH*
>NC_001341.1_2 # 788 # 1009 # 1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.306
MQCLISNEYHHNNNEHTSCINRINRNYRSNQRHHQGYNDLYDSINIIQGMLENLNASIVY
FTKDGKYKLIMTL*

Given this multiple predicted proteins, I want to use diamond blastp to compare pairwise all the proteins obtaining this way a matrix proteinxprotein with similarity scores between proteins.

This is my current code

diamond makedb  --in diamond_database/database.fa -d diamond_database/database.dmnd

 diamond blastp --sensitive -d diamond_database/database.dmnd -q CDS/all_protein.translations.fa -o 
 diamond_database/results.tab -k 1

Where I obtain something like this:

NC_028834.1_2   YP_009201735.1  100 171 0   0   1   171 1   171 3.64e-128   358
NC_028834.1_4   YP_009201736.1  100 110 0   0   1   110 1   110 7.49e-76    221
NC_028834.1_5   YP_009201737.1  100 178 0   0   1   178 1   178 7.81e-122   343
NC_028834.1_6   YP_009201738.1  100 46  0   0   1   46  1   46  1.11e-26    92.4
NC_028834.1_7   YP_009201739.1  100 184 0   0   1   184 1   184 1.09e-115   328
NC_028834.1_8   YP_009201740.1  100 494 0   0   1   494 1   494 0.0 1065

I am not sure how to interpret this but it seems to be comparing with a reference rather than pairwise.

protein blastp clustering diamond • 984 views

ADD COMMENT • link 12 months ago by Bioinformatics_begginner ▴ 20

0

Entering edit mode

How many proteins do you have, what do want to do with the data, and could you maybe use a standard phylogenetic approach (multiple sequence alignment -> ML or Bayesian phylogenetic analysis) ? None of the values from a local alignment is really good for generating a distance matrix and a phylogenetic analysis based on a distance matrix isn't state-of-the-art.

ADD REPLY • link 12 months ago by Michael 55k

0

Entering edit mode

just to make sure, there are two different intermingled questions here,

for the start, diamond will take each sequence from the query and search it against each target sequence. So it is an all-to-all comparison but only the sequences that can generate hits will be reported. So perhaps you can reformulate what you mean by: "seems to only compare with a reference rather than pairwise"

then when you run a blastp alignment you get similarity scores of various kinds, whether those are meaningful or not is a different issue of course

ADD REPLY • link 12 months ago by Istvan Albert 102k

0

Entering edit mode

I am not sure why you are predicting proteins for this RefSeq genome. You can download a protein fasta file by using the genome page here: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000847085.1/

Click on Download button --> Select following options and download a protein fasta file.

download

You will get the following 4 proteins

>NP_039813.1 hypothetical protein L1_1 [Acholeplasma phage MV-L51]
MEYIEITVNKSQHFVNTDQIYNMLWSSAIMQCLISNEYHHNNNEHTSCINRINRNYRSNQRHHQGYNDLYDSINIIQGML
ENLNASIVYFTKDGKYKLIMTL
>NP_039814.1 hypothetical protein L1_2 [Acholeplasma phage MV-L51]
MKKILFSIIAIFTLLITIKGVHAYDTGSNVLTNNAYTVTYTSEPLGILAKQSKYQYHSMATQSAMIHWLLMVYIAVGYIY
DTSTSYSGIGQGYSPQVTQTDRAIYYQGQLKINTTNIKTSKRVKDIHNQIKQTQPYNYITFTTQKQTTSTKQYQQSKQQY
QSY
>NP_039815.1 hypothetical protein L1_3 [Acholeplasma phage MV-L51]
MKKYAKLIFIFIMIIFIMNYFIENADAFRLIIMEVLQVQSTKSFTIFIQLYSFKKVINRLCYIFRFLPFVSSLSLIYLIL
SWRIHHIVGHKMTKSLLDMSFSTWSLYLGSILLYTQVVWGLSLLIIAIIALYFILKFWWNNSQVTKDFTLYNIHIYGAKQ
RGKDLTTQTIYRRFHKEYNKRIKQLKNALKLVFK
>NP_039816.1 hypothetical protein L1_4 [Acholeplasma phage MV-L51]
MKFQKIHKYEGLDLFLSDVQTALPKQTSFTHWLPVFFGLAGQLYNMLIIINSQEFERPWVKLRGQQDKYYRAIKTTPFNK
SFIQKYVWNEIPILKNYIFIKIRSYSELQSAQQNVLPFNAIGLINEGSKHLVLTAGQATKEQFEATYGEIKERTIVINKK
HIMYDSRIFHKYFYGITAQGAIELYNKIESLTPPNFIKQNARGRKRYLLLRNKCPVGIIKSLVAVCQGLDVIKDGMALKL
LENSFCSKKHKAYDTLKALRITKNLN

ADD REPLY • link 12 months ago by GenoMax 148k