Entering edit mode
11 months ago
Bioinformatics_begginner
▴
20
Hi.
I have a .fa file representing multiple predicted proteins.
Example:
>NC_001341.1_1 # 397 # 714 # 1 # ID=1_1;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.314
MCSSSIISEKHLKKNIFQKKAKVQYKIKKNRRGQINENKCSINPNKKRSKKIKKLAKQKD
IQACINIGNRYVDVPIRPVSVADPDTPKETKEDKEKGCHFRNGIH*
>NC_001341.1_2 # 788 # 1009 # 1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.306
MQCLISNEYHHNNNEHTSCINRINRNYRSNQRHHQGYNDLYDSINIIQGMLENLNASIVY
FTKDGKYKLIMTL*
Given this multiple predicted proteins, I want to use diamond blastp to compare pairwise all the proteins obtaining this way a matrix proteinxprotein with similarity scores between proteins.
This is my current code
diamond makedb --in diamond_database/database.fa -d diamond_database/database.dmnd
diamond blastp --sensitive -d diamond_database/database.dmnd -q CDS/all_protein.translations.fa -o
diamond_database/results.tab -k 1
Where I obtain something like this:
NC_028834.1_2 YP_009201735.1 100 171 0 0 1 171 1 171 3.64e-128 358
NC_028834.1_4 YP_009201736.1 100 110 0 0 1 110 1 110 7.49e-76 221
NC_028834.1_5 YP_009201737.1 100 178 0 0 1 178 1 178 7.81e-122 343
NC_028834.1_6 YP_009201738.1 100 46 0 0 1 46 1 46 1.11e-26 92.4
NC_028834.1_7 YP_009201739.1 100 184 0 0 1 184 1 184 1.09e-115 328
NC_028834.1_8 YP_009201740.1 100 494 0 0 1 494 1 494 0.0 1065
I am not sure how to interpret this but it seems to be comparing with a reference rather than pairwise.
How many proteins do you have, what do want to do with the data, and could you maybe use a standard phylogenetic approach (multiple sequence alignment -> ML or Bayesian phylogenetic analysis) ? None of the values from a local alignment is really good for generating a distance matrix and a phylogenetic analysis based on a distance matrix isn't state-of-the-art.
just to make sure, there are two different intermingled questions here,
for the start, diamond will take each sequence from the query and search it against each target sequence. So it is an all-to-all comparison but only the sequences that can generate hits will be reported. So perhaps you can reformulate what you mean by: "seems to only compare with a reference rather than pairwise"
then when you run a blastp alignment you get similarity scores of various kinds, whether those are meaningful or not is a different issue of course
I am not sure why you are predicting proteins for this RefSeq genome. You can download a protein fasta file by using the genome page here: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000847085.1/
Click on
Download
button --> Select following options and download a protein fasta file.You will get the following 4 proteins