Hello everyone,
I have a dataset of 4000 protein fasta sequences with me.
Example dataset:
>Acaryochloris_marina_peg_1945
MNILAVIPARYQSQRFPGKPLVMLDERPMVQWVYEAAKSCDFFQDAVVATDSDKIADCVKGFGGKVVMTRDDHLTGTDRVAEVAGFYDDMDVVVNVQGDQPFVTPEALEQLVRPYREGERPEMTTLGCPLDMDEDYASPNAVKVLCDRNGHALYFSRSPIPYFRTQGTVPVYHHLGLYAFRHDFLMQYSQLEPTPFETCEGLEQLRVLEYGYAIKVCQTQKAAIEINTPEDLVKAQLFIQQGMTS
>Acetivibrio_clariflavus_peg_3972
MRTGVIVQTRMGSSRLPGKVMIDICGKPVIEHVIDRLKMSKVLDDIIIATTTSVKDKIIVEQAKRNGVKWFCGSEEDVLSRYYYAARENRLSTVVRVTSDCPLIDPVILDEIVEFYKKNDYLLVTNAGNILEYRTYPRGLDVEVFSFDILEKAFYSAKKPYQREHVTPYIYETYENKIYYYKNNINLSKYRWTLDTEEDLKLISIIFNNFYYKYGRNFGFKDILKFIQSNPQLSKINEHIEQKKIG
>Acetivibrio_clariflavus_peg_4060
MKILFTICGRKGSKGVKSKNIKTFLGFPLAFYTASFIDLFIKRNNWVDSDIVLNTDSENLIDLFKNKLNMPIEIIERDPELAKDYVPKISVIKNCYDVMVERKKVSYDIVIDLDITSPLRRLRDLQSLIEKKLNSNADVVFSVTSARRNPYFNMVKKGENGYERVIESSFNARQEAPNVFDINGSMYAYSPDFLKSGKGLFDGICDIIEMRDTAVLDIDHENDFELMEVIAKYLYSSDNEYNCIRENINNILLKD
>Acetivibrio_saccincola_peg_0278
MKKVVAIIQARMGSTRLPGKVMKNLCGKTVLAHDIERVRQSKYIDEIVIATTKFKEDDIILREALENGAKVYRGSEDDVLRRYYEAAKENKADVIVRITSDCPLIDPFIVDEVIKVYLNSNYDLVTNAGIYPENRTYPRGLDVEVFSFDILKKAFQEAKEMYQREHVTPFIYENSKNIYYYKNDIDYSKYRWTLDTEEDYKLIEIIYKSFIKESIIFILMIF*SCLIICQSYLKLIKMCNKKLSVD
>Acetivibrio_saccincola_peg_0286
MKVSAIIQARTGSSRLPGKVLKEICGLPVLVHVINRVKQAKKVNEIIVATTDKASDEVIVDISEMENIKVFRGSEEDVLERYYKTALHFKSDIIVRITSDNPLTDHRLIDKIVENLIIHNADYSCNNMPSTYPYGLDCECFTFQVLEEAFFNAKDKYEREHVTPYIRENKELFKIVSIKGNDNYSHLRWTLDTQEDYNHIKEIFENLYHKNKYFLTEDIIQFLQENKRI
I want to make a amino acid identity scoring matrix of each protein with each protein. And I want to generate a matrix table for the result. So it means my X-axis is my protein set and Y-axis is the same protein set. It means each protein similarity score will be given with each protein.
I have tried few of the online available tools, AAI profiler and comparem, I couldn't really get exactly what I am looking for. Is there any way that I can do it with the blast search?
Thank you
You should just script this yourself... R or python would be easier than bash
I would advise strongly against it, one could easily cook up some script using blast, doing something but that is most likely going to be detrimental (meaning you are worse off with such a solution than none at all, e.g. because the use of Blast for this task is the wrong approach). Instead of just starting to hack something it is recommended:
I think you maybe misread bash as blast but based on OP's question... It would take under 20 lines to write your own function if you had to write your own protein distance function. Assuming you don't want something super complicated. Alternatively, if you use an existing package you can directly generate the 'identity scoring matrix' that OP wants in essentially a single line:
That looks like a proper existing solution to me, possibly I misunderstood your intent. I wouldn't even call that scripting.
Btw, possibly we shouldn't use Jukes-Cantor for protein sequences? But rather JTT, LG, or WAG?
Hi , thank you for your suggestion. as far I came to know DistanceMatrix will tell me the dissimilarity score between two sequences. I want to know the similarity score .
Is there any parameter for that ?
amino acid identity scoring matrix or amino acid similarity scoring matrix ?
Hi, I want to know the AAI(Amino Acid identity) score between the sequences.